Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Slides:



Advertisements
Similar presentations
Function Evaluation Using Tables and Small Multipliers CS252A, Spring 2005 Jason Fong.
Advertisements

Simulation of Fracturable LUTs
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
A Digital Circuit Toolbox
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
1 CONSTRUCTING AN ARITHMETIC LOGIC UNIT CHAPTER 4: PART II.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
RT-Level Custom Design. This Week in DIG II  Introduction  Combinational logic  Sequential logic  Custom single-purpose processor design  Review.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Study of AES Encryption/Decription Optimizations Nathan Windels.
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
Titan: Large and Complex Benchmarks in Academic CAD
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
PROGRAMMABLE LOGIC DEVICES (PLD)
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
Implementation of Finite Field Inversion
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
1 2-Hardware Design Basics of Embedded Processors (cont.)
Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.
1 Lecture 6 BOOLEAN ALGEBRA and GATES Building a 32 bit processor PH 3: B.1-B.5.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Lecture 6: Mapping to Embedded Memory and PLAs September 27, 2004 ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
An Improved “Soft” eFPGA Design and Implementation Strategy
Give qualifications of instructors: DAP
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.
Exploring SOPC Performance Across FPGA Architectures Franjo Plavec June 9, 2006.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk†
Advanced Algorithms Analysis and Design
Floating-Point FPGA (FPFPGA)
Presentation on FPGA Technology of
Application-Specific Customization of Soft Processor Microarchitecture
Andy Ye, Jonathan Rose, David Lewis
High-Level Synthesis for Side-Channel Defense
Objective of This Course
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
CS 201 Compiler Construction
Presentation transcript:

Resource Sharing in LegUp

Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing functional units

Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing functional units E.g. consider a C program which performs division twice: a b c d z w //

Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing functional units E.g. consider a C program which performs division twice: a c b d a b c d z w z, w // /

Resource Sharing in High Level Synthesis Intuitively, large operators such as dividers, remainder and multipliers are beneficial to share But because multiplexors are relatively expensive to implement in FPGAs, generally smaller operators (adders, bitwise operations) are not shared

Example – Sharing a Bitwise AND Consider a Bitwise AND: &

Example – Sharing a Bitwise AND Consider a Bitwise AND: 2 Input LUT &

Example – Sharing a Bitwise AND Consider a Bitwise AND:And a 2-to-1 MUX: 2 Input LUT 3-input LUT &

Example – Sharing a Bitwise AND Therefore this seems like a bad idea: &

Example – Sharing a Bitwise AND Therefore this seems like a bad idea: But in fact, this depends on the LUT architecture &

Project Overview and Goals Determine conclusively for which operators sharing is beneficial in FPGAs Consider architectural impact: – 4-input LUT architectures (Cyclone II) – 6-LUT (Adaptive LUT) architectures (Stratix IV) Identify/analyze the benefits of sharing patterns of smaller operations (e.g. multiplication followed by add)

Stratix IV, Adaptive Logic Modules (ALM) Each ALM contains 2 Adaptive LUTs (ALUTs) which can implement a function of between 4 and 7 inputs

Stratix IV, Adaptive Logic Modules (ALM) Each ALM contains 2 Adaptive LUTs (ALUTs) which can implement a function of between 4 and 7 inputs Cyclone II

ALM Example Consider two circuits: Circuit 1: Implemented using input LUTs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs

Adaptive Logic Modules (ALM)

50

Adaptive Logic Modules (ALM) 45 50

ALM Example Consider two circuits: Circuit 1: Implemented using input LUTs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs

ALM Example Consider two circuits: Circuit 1: Implemented using input LUTs  Requires 50 ALMs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs

ALM Example Consider two circuits: Circuit 1: Implemented using input LUTs  Requires 50 ALMs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs  Requires 45 ALMs, even though the circuit contains more logic

Resource Sharing in Stratix IV All of the circuits created by LegUp tend to use mostly 2 and 3 input functions (LUTs)

Resource Sharing in Stratix IV All of the circuits created by LegUp tend to use mostly 2 and 3 input functions (LUTs) ALUT Size 71% 70% 78% 45% 48% 57% 65% 55% 53% 75% Average: 62%

Sharing Single Operations Given that LegUp-generated circuits contain mostly 2-3 input functions therefore, the number of ALMs can be reduced by packing many “smaller LUTs” into fewer “larger LUTs”

Sharing Single Operations Given that LegUp-generated circuits contain mostly 2-3 input functions therefore, the number of ALMs can be reduced by packing many “smaller LUTs” into fewer “larger LUTs” Revisit the example of the Bitwise AND

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND & 32

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 32 LUTs & 32

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 32 LUTs (all 2-input LUTs) & 32

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 64 LUTs (all 2-input LUTs) & 32 &

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 64 LUTs 32 LUTs (all 2-input LUTs) (5-input LUTs) & 32 & &

Sharing Single Operations In the example of bitwise operations, we can reduce the number of LUTs by half at the expense of increasing their size However, if a circuits contains mostly small LUTs, ALMs are being under-utilized and can incorporate these larger logic functions Therefore, sharing even small operations reduces ALUT and ALM usage

Variable Liveness Analysis Consider next if each bitwise AND had its output stored in a register: & 32 &

Variable Liveness Analysis Consider next if each bitwise AND had its output stored in a register: 64 Registers & 32 &

Variable Liveness Analysis Consider next if each bitwise AND had its output stored in a register: 64 Registers 32 Registers (if lifetimes are independent) & 32 & &

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”)

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”) & 32 &

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”) 2. Registers were placed at the inputs and outputs to isolate delay

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”) 2. Registers were placed at the inputs and outputs to isolate delay 3. Area and speed results were obtained for each instruction, in each configuration, and for Cyclone II and Stratix IV

Evaluating Area of Single Operators Sharing is beneficial when ratios (in brackets) are less than 2 More operators show benefit in Stratix IV due to the flexible LUT architecture

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + –

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + + –

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & – ++

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + – +

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + – +

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + + –

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations By sharing chains instead of only single operations, the amount of multiplexing is reduced and ALUTs decrease further

Computational patterns are represented as Directed Graphs, with a single output (“root”) node: Each node is an instruction Input Input Input Size 5 Graph Sharing Computational Patterns + – * + &

Pattern Sharing Algorithm: 1.Find all computational patterns in the software program 2.Sort patterns by equivalent functionality 3.Determine which patterns are candidates for sharing and choose (optimal?) pairing Sharing Computational Patterns

LLVM produces a Data Flow Graph to represent each compiled C Program The first step of the pattern sharing is to find all subgraphs which are candidates for sharing 1. Finding all Computational Patterns

const

1. Finding all Computational Patterns const Size: 1

1. Finding all Computational Patterns const Size: 2

1. Finding all Computational Patterns const Size: 3

1. Finding all Computational Patterns const Size: 4

1. Finding all Computational Patterns const Only one root allowed

1. Finding all Computational Patterns const Size: 5

2. Sorting Patterns By Functional Equivalence + – << + & A B C D E + – + & A B C D E a) A Graph with a re-converging path b) This graph is functionally identical to (a) but topologically different due to commutativity

2. Sorting Patterns By Functional Equivalence + – << + & A B C D E + – + & A B C D E a) A Graph with a re-converging path b) This graph is functionally identical to (a) but topologically different due to commutativity (As opposed to just topological)

So far, steps 1 and 2 have provided sets of equivalent patterns 3. Decide which Pattern Instances to Share

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: A B C D 3. Decide which Pattern Instances to Share – + – –– + + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: 3. Decide which Pattern Instances to Share – – + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: 3. Decide which Pattern Instances to Share – – – + + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: A B C D 3. Decide which Pattern Instances to Share – – ––

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: A B C D Our goal is to split these 4 into pairs (create groups of 2) so that each hardware unit will implement two patterns But which combination of pairs is best? 3. Decide which Pattern Instances to Share – – ––

Variable Lifetimes Optimization Prefer to share patterns with non-overlapping lifetimes – Saves registers. – + – + % & A B P1 P2 – + – + % & A B P1 P2 a) Values A,B have overlapping lifetimes b) Values A,B have non- overlapping lifetimes

Variable Lifetimes Optimization Prefer to share patterns with non-overlapping lifetimes – Saves registers. – + – + % & A B P1 P2 – + – + % & A B P1 P2 a) Values A,B have overlapping lifetimes b) Values A,B have non- overlapping lifetimes

Prefer to share patterns with shared input variables – Reduces multiplexing cost + A B + AC + D E 123 Shared Input Variable Optimization

Prefer to share patterns with shared input variables – Reduces multiplexing cost + A B + AC + D E A B C 1, 2 Shared Input Variable Optimization

Adder C would be optimized by synthesis tools because only six outputs bits are needed Sharing adder C with A or B would force a 6-bit addition to be implemented using a 32-bit adder Bit Width Optimization & 6 …… A B C

Adder C would be optimized by synthesis tools because only six outputs bits are needed Sharing adder C with A or B would force a 6-bit addition to be implemented using a 32-bit adder Bit Width Optimization & 6 …… A B C

Considering these optimizations, a cost function is used to select between possible pairs of graphs Once pairs have been determined, the Binding phase of LegUp is modified to implement pairs of computational patterns with the same hardware 3. Decide which Pattern Instances to Share

Results

Geomean 4.9% improvement for pattern sharing (i.e. between columns 2 and 3)

Geomean 4.2% improvement for pattern sharing (i.e. between columns 2 and 3)

48% 57% 31% 41% 55% 43% 42% 40% 36% 44% 64% Average: 45% (was 62%)

FPGA logic architecture has significant impact on resource sharing Resource sharing can provide >10% area reduction Future work: alter scheduling to favor creation of certain patterns – Provide more sharing opportunities Paper on this is under review for FPGA 2012 – Contains many details; advanced copy is available Summary