CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

Slides:



Advertisements
Similar presentations
CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
Advertisements

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 4, 2011 Synchronous Circuits.
ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
CS294-6 Reconfigurable Computing Day 6 September 10, 1998 Comparing Computing Devices.
CS294-6 Reconfigurable Computing Day 8 September 17, 1998 Interconnect Requirements.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Digital Design – Optimizations and Tradeoffs
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
DLX Instruction Format
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.
CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.
CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
FPGA Technology Mapping. 2 Technology mapping:  Implements the optimized nodes of the Boolean network to the target device library.  For FPGA, library.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 9: February 24, 2014 Operator Sharing, Virtualization, Programmable Architectures.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day18: November 22, 2000 Control.
Processor Architecture
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 3: January 16, 2013 Scheduled Operator Sharing.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: March 3, 2003 Control.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #23 – Function.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 13: March 1, 2003 Scheduling Introduction.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Lecture 4: Contrasting Processors: Fixed and Configurable September 20, 2004 ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.
Real-World Pipelines Idea Divide process into independent stages
Lecture 27 Logistics Last lecture Today: HW8 due Friday
CS Spring 2008 – Lec #17 – Retiming - 1
CS184a: Computer Architecture (Structure and Organization)
Gouraud-shaded Triangle Rasterization
Lecture 27 Logistics Last lecture Today: HW8 due Friday
ESE534: Computer Organization
Pipeline Principle A non-pipelined system of combination circuits (A, B, C) that computation requires total of 300 picoseconds. Comb. logic.
CprE / ComS 583 Reconfigurable Computing
ESE534: Computer Organization
Presentation transcript:

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext

Today Multicontext –Review why –Cost –Packing into contexts –Retiming implications

Review Single context: What happens when –Cycles in flowgraph abundant data level parallelism no data level parallelism –Low throughput tasks structured (e.g. datapaths) unstructured –Data dependent operations similar ops dis-similar ops

What Happens?

Single Context When have: –cycles and no data parallelism –low throughput, unstructured tasks –dis-similar data dependent tasks Active resources sit idle most of the time –Waste of resources Why?

Single Context: Why? Cannot reuse resources to perform different function, only same

Resource Reuse To use resources in these cases –must direct to do different things. Must be able tell resources how to behave => separate instructions (pinsts) for each behavior

Example: Serial Evaluation

Example: Dis-similar Operations

Multicontext Organization/Area A ctxt  80K 2 –dense encoding A base  800K 2 A ctxt :A base = 10:1

Example: DPGA Prototype

Example: DPGA Area

Multicontext Tradeoff Curves Assume Ideal packing: N active =N total /L Reminder: Robust point: c*A ctxt =A base

In Practice Scheduling Limitations Retiming Limitations

Scheduling Limitations N A (active) –size of largest stage Precedence: –can evaluate a LUT only after predecessors have been evaluated –cannot always, completely equalize stage requirements

Scheduling Precedence limits packing freedom Freedom do have –shows up as slack in network

Scheduling Computing Slack: –ASAP (As Soon As Possible) Schedule propagate depth forward from primary inputs –depth = 1 + max input depth –ALAP (As Late As Possible) Schedule propagate distance from outputs back from outputs –level = 1 + max output consumption level –Slack slack = L+1-(depth+level) [PI depth=0, PO level=0]

Slack Example

Allowable Schedules Active LUTs (N A ) = 3

Sequentialization Adding time slots –more sequential (more latency) –add slack allows better balance L=4  N A =2 (4 or 3 contexts)

Multicontext Scheduling “Retiming” for multicontext –goal: minimize peak resource requirements NP-complete list schedule, aneal

Multicontext Data Retiming How do we accommodate intermediate data? Effects?

Signal Retiming Non-pipelined –hold value on LUT Output (wire) from production through consumption –Wastes wire and switches by occupying for entire critical path delay L not just for 1/L’th of cycle takes to cross wire segment –How show up in multicontext?

Signal Retiming Multicontext equivalent –need LUT to hold value for each intermediate context

Alternate Retiming Recall from Day 17 –Net buffer smaller than LUT –Output retiming may have to route multiple times –Input buffer chain only need LUT every depth cycles

Input Buffer Retiming Can only take K unique inputs per cycle Configuration depth differ from context-to- context

DES Latency Example Single Output case

ASCII  Hex Example Single Context: K 2 =18.5M 2

ASCII  Hex Example Three Contexts: K 2 =12.5M 2

ASCII  Hex Example All retiming on wires (active outputs) –saturation based on inputs to largest stage Ideal  Perfect scheduling spread + no retime overhead

ASCII  Hex Example (input depth=4, c=6: 5.5M 2 (compare 18.5M 2 )

General throughput mapping: If only want to achieve limited throughput Target produce new result every t cycles Spatially pipeline every t stages –cycle = t retime to minimize register requirements multicontext evaluation w/in a spatial stage –retime (list schedule) to minimize resource usage Map for depth (i) and contexts (c)

Benchmark Set 23 MCNC circuits –area mapped with SIS and Chortle

Multicontext vs. Throughput

Summary (1 of 2) Several cases cannot profitably reuse same logic at device cycle rate –cycles, no data parallelism –low throughput, unstructured –dis-similar data dependent computations These cases benefit from more than one instructions/operations per active element A ctxt << A active makes interesting –save area by sharing active among instructions

Summary (2 of 2) Economical retiming becomes important here to achieve active LUT reduction –one output reg/LUT leads to early saturation c=4--8, I=4--6 automatically mapped designs 1/2 to 1/3 single context size Most FPGAs typically run in realm where multicontext is smaller –How many for intrinsic reasons? –How many for lack of HSRA-like register/CAD support?