CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.

Slides:

Advertisements

Similar presentations

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Advertisements

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Graduate Computer Architecture I Lecture 16: FPGA Design.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

 Y. Hu, V. Shih, R. Majumdar and L. He, “Exploiting Symmetries to Speedup SAT-based Boolean Matching for Logic Synthesis of FPGAs”, TCAD  Y. Hu,

CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #3 – FPGA.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Power Reduction for FPGA using Multiple Vdd/Vth

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

J. Christiansen, CERN - EP/MIC

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.

1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #23 – Function.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Chapter One Introduction to Pipelined Processors.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.

Lecture 4: Contrasting Processors: Fixed and Configurable September 20, 2004 ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed.

PipeliningPipelining Computer Architecture (Fall 2006)

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Parallel Programming By J. H. Wang May 2, 2017.

Instructor: Dr. Phillip Jones

ESE534: Computer Organization

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Pipeline Principle A non-pipelined system of combination circuits (A, B, C) that computation requires total of 300 picoseconds. Comb. logic.

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Presentation transcript:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context FPGAs

Lect-22.2CprE 583 – Reconfigurable ComputingNovember 6, 2007 process (a, b, c) in port a, b; out port c; { read(a); … write(c); } Specification Line () { a = … … detach } Processor Capture ModelFPGA Partition Synthesize Interface Recap – HW/SW Partitioning Good partitioning mechanism: 1) Minimize communication across bus 2) Allows parallelism  both hardware (FPGA) and processor operating concurrently 3) Near peak processor utilization at all times (performing useful work)

Lect-22.3CprE 583 – Reconfigurable ComputingNovember 6, 2007 Recap – Communication and Control Need to signal between CPU and accelerator Data ready Complete Implementations: Shared memory FIFO Handshake If computation time is very predictable, a simpler communication scheme may be possible

Lect-22.4CprE 583 – Reconfigurable ComputingNovember 6, 2007 Informal Specification, Constraints System model Architecture design HW/SW implementation PrototypeTest Implementation Fail Success Component profiling Performance evaluation Recap – System-Level Methodology

Lect-22.5CprE 583 – Reconfigurable ComputingNovember 6, 2007 Outline Recap Multicontext Motivation Cost analysis Hardware support Examples

Lect-22.6CprE 583 – Reconfigurable ComputingNovember 6, 2007 Single Context When we have Cycles and no data parallelism Low throughput, unstructured tasks Dissimilar data dependent tasks Active resources sit idle most of the time Waste of resources Why?

Lect-22.7CprE 583 – Reconfigurable ComputingNovember 6, 2007 Single Context: Why? Cannot reuse resources to perform different functions, only the same functions

Lect-22.8CprE 583 – Reconfigurable ComputingNovember 6, 2007 Multiple-Context LUT Configuration selects operation of computation unit Context identifier changes over time to allow change in functionality DPGA – Dynamically Programmable Gate Array

Lect-22.9CprE 583 – Reconfigurable ComputingNovember 6, 2007 A B F0F0 F1F1 F2F2 Non-pipelined example Computations that Benefit Low throughput tasks Data dependent operations Effective if not all resources active simultaneously Possible to time-multiplex both logic and routing resources

Lect-22.10CprE 583 – Reconfigurable ComputingNovember 6, 2007 Computations that Benefit (cont.)

Lect-22.11CprE 583 – Reconfigurable ComputingNovember 6, 2007 Computations that Benefit (cont.)

Lect-22.12CprE 583 – Reconfigurable ComputingNovember 6, 2007 Resource Reuse Resources must be directed to do different things at different times through instructions Different local configurations can be thought of as instructions Minimizing the number and size of instructions a key to successfully achieving efficient design What are the implications for the hardware?

Lect-22.13CprE 583 – Reconfigurable ComputingNovember 6, 2007 Example: ASCII – Binary Conversion Input: ASCII Hex character Output: Binary value signal input : std_logic_vector(7 downto 0); signal output : std_logic_vector(3 downto 0); process (input) begin end process;

Lect-22.14CprE 583 – Reconfigurable ComputingNovember 6, 2007 ASCII – Binary Conversion Circuit

Lect-22.15CprE 583 – Reconfigurable ComputingNovember 6, 2007 Implementation #1Implementation #2 N A = 3N A = 4 Implementation Choices Both require same amount of execution time Implementation #1 more resource efficient

Lect-22.16CprE 583 – Reconfigurable ComputingNovember 6, 2007 Interconnect Mux Logic Reuse A ctxt  80K 2 dense encoding A base  800K 2 * Each context not overly costly compared to base cost of wire, switches, IO circuitry Question: How does this effect scale? Previous Study [Deh96B]

Lect-22.17CprE 583 – Reconfigurable ComputingNovember 6, 2007 DPGA Prototype

Lect-22.18CprE 583 – Reconfigurable ComputingNovember 6, 2007 Assume ideal packing: N active = N total /L Reminder: c*A ctxt = A base Difficult to exactly balance resources/demands Needs for contexts may vary across applications Robust point where critical path length equals # contexts Multicontext Tradeoff Curves

Lect-22.19CprE 583 – Reconfigurable ComputingNovember 6, 2007 In Practice Scheduling limitations Retiming limitations

Lect-22.20CprE 583 – Reconfigurable ComputingNovember 6, 2007 Scheduling Limitations N A (active) Size of largest stage Precedence: Can evaluate a LUT only after predecessors have been evaluated Cannot always, completely equalize stage requirements

Lect-22.21CprE 583 – Reconfigurable ComputingNovember 6, 2007 Scheduling Precedence limits packing freedom Freedom shows up as slack in network

Lect-22.22CprE 583 – Reconfigurable ComputingNovember 6, 2007 Scheduling Computing Slack: ASAP (As Soon As Possible) Schedule propagate depth forward from primary inputs depth = 1 + max input depth ALAP (As Late As Possible) Schedule propagate distance from outputs back from outputs level = 1 + max output consumption level Slack slack = L+1-(depth+level) [PI depth=0, PO level=0]

Lect-22.23CprE 583 – Reconfigurable ComputingNovember 6, 2007 Allowable Schedules Active LUTs (N A ) = 3

Lect-22.24CprE 583 – Reconfigurable ComputingNovember 6, 2007 Sequentialization Adding time slots More sequential (more latency) Adds slack Allows better balance L=4  N A =2 (4 or 3 contexts)

Lect-22.25CprE 583 – Reconfigurable ComputingNovember 6, 2007 Multicontext Scheduling “Retiming” for multicontext goal: minimize peak resource requirements NP-complete List schedule, anneal How do we accommodate intermediate data? Effects?

Lect-22.26CprE 583 – Reconfigurable ComputingNovember 6, 2007 Signal Retiming Non-pipelined hold value on LUT Output (wire) from production through consumption Wastes wire and switches by occupying For entire critical path delay L Not just for 1/L’th of cycle takes to cross wire segment How will it show up in multicontext?

Lect-22.27CprE 583 – Reconfigurable ComputingNovember 6, 2007 Signal Retiming Multicontext equivalent Need LUT to hold value for each intermediate context

Lect-22.28CprE 583 – Reconfigurable ComputingNovember 6, 2007 Logically three levels of dependence Single Context: K 2 =18.5M 2 Full ASCII  Hex Circuit

Lect-22.29CprE 583 – Reconfigurable ComputingNovember 6, 2007 Three contexts: K 2 =12.5M 2 Pipelining needed for dependent paths Multicontext Version

Lect-22.30CprE 583 – Reconfigurable ComputingNovember 6, 2007 ASCII  Hex Example All retiming on wires (active outputs) Saturation based on inputs to largest stage With enough contexts only one LUT needed Increased LUT area due to additional stored configuration information Eventually additional interconnect savings taken up by LUT configuration overhead Ideal  Perfect scheduling spread + no retime overhead

Lect-22.31CprE 583 – Reconfigurable ComputingNovember 6, depth=4, c=6: 5.5M 2 (compare 18.5M 2 ) ASCII  Hex Example (cont.)

Lect-22.32CprE 583 – Reconfigurable ComputingNovember 6, 2007 General Throughput Mapping If only want to achieve limited throughput Target produce new result every t cycles Spatially pipeline every t stages cycle = t Retime to minimize register requirements Multicontext evaluation w/in a spatial stage Retime (list schedule) to minimize resource usage Map for depth (i) and contexts (c)

Lect-22.33CprE 583 – Reconfigurable ComputingNovember 6, MCNC circuits Area mapped with SIS and Chortle Benchmark Set

Lect-22.34CprE 583 – Reconfigurable ComputingNovember 6, 2007 Area v. Throughput

Lect-22.35CprE 583 – Reconfigurable ComputingNovember 6, 2007 Area v. Throughput (cont.)

Lect-22.36CprE 583 – Reconfigurable ComputingNovember 6, 2007 Reconfiguration for Fault Tolerance Embedded systems require high reliability in the presence of transient or permanent faults FPGAs contain substantial redundancy Possible to dynamically “configure around” problem areas Numerous on-line and off-line solutions

Lect-22.37CprE 583 – Reconfigurable ComputingNovember 6, 2007 Huang and McCluskey Assume that each FPGA column is equivalent in terms of logic and routing Preserve empty columns for future use Somewhat wasteful Precompile and compress differences in bitstreams Column Based Reconfiguration

Lect-22.38CprE 583 – Reconfigurable ComputingNovember 6, 2007 Create multiple copies of the same design with different unused columns Only requires different inter-block connections Can lead to unreasonable configuration count Column Based Reconfiguration

Lect-22.39CprE 583 – Reconfigurable ComputingNovember 6, 2007 Determining differences and compressing the results leads to “reasonable” overhead Scalability and fault diagnosis are issues Column Based Reconfiguration

Lect-22.40CprE 583 – Reconfigurable ComputingNovember 6, 2007 Summary In many cases cannot profitably reuse logic at device cycle rate Cycles, no data parallelism Low throughput, unstructured Dissimilar data dependent computations These cases benefit from having more than one instructions/operations per active element Economical retiming becomes important here to achieve active LUT reduction For c=[4,8], I=[4,6] automatically mapped designs are 1/2 to 1/3 single context size