CprE / ComS 583 Reconfigurable Computing

Slides:

Advertisements

Similar presentations

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.

Advertisements

BR 1/991 Programmable Logic There has to be a better way to implement a logic function than to hook together discrete 74XX packages or create a custom.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.

Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.

ENGIN112 L31: Read Only Memory November 17, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 31 Read Only Memory (ROM)

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.

CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.

CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.

1 Introduction A digital circuit design is just an idea, perhaps drawn on paper We eventually need to implement the circuit on a physical device –How do.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 12: February 21, 2007 Compute 2: Cascades, ALUs, PLAs.

GPGPU platforms GP - General Purpose computation using GPU

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #3 – FPGA.

Figure to-1 Multiplexer and Switch Analog

EGRE 427 Advanced Digital Design Figures from Application-Specific Integrated Circuits, Michael John Sebastian Smith, Addison Wesley, 1997 Chapter 7 Programmable.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Lecture 30 Read Only Memory (ROM)

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Programmable Logic Devices

EE3A1 Computer Hardware and Digital Design

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

CBSSS 2002: DeHon Universal Programming Organization André DeHon Tuesday, June 18, 2002.

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Chapter- 9 Programmable Logic Devices DHADUK ANKITA ENRL NO Noble Engineering College- Junagadh.

1 Programmable Logic There are other ways to implement a logic function than to hook together discrete 74XX packages or create a custom Integrated Circuit.

Field Programmable Gate Arrays

This chapter in the book includes: Objectives Study Guide

ETE Digital Electronics

Computer Organization and Architecture + Networks

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Give qualifications of instructors: DAP

CS184a: Computer Architecture (Structure and Organization)

XILINX FPGAs Xilinx lunched first commercial FPGA XC2000 in 1985

CS184a: Computer Architecture (Structure and Organization)

From Silicon to Microelectronics Yahya Lakys EE & CE 200 Fall 2014

Logic and Computer Design Fundamentals

CS137: Electronic Design Automation

This chapter in the book includes: Objectives Study Guide

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Topics Circuit design for FPGAs: Logic elements. Interconnect.

Lecture 11 Logistics Last lecture Today HW3 due now

Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.

Jason Klaus, Duncan Elliott Confidential

ESE534: Computer Organization

Combinational Circuits

CprE / ComS 583 Reconfigurable Computing

Give qualifications of instructors: DAP

ECE 352 Digital System Fundamentals

Combinational Circuits

FIGURE 5-1 MOS Transistor, Symbols, and Switch Models

CS137: Electronic Design Automation

ECE 352 Digital System Fundamentals

CS184a: Computer Architecture (Structure and Organization)

ECE 352 Digital System Fundamentals

Circuit Simplification and

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA Technology Mapping

CprE 583 – Reconfigurable Computing Quick Points Lectures are viewable for on-campus students via WebCT Class e-mail list created: cpre583@iastate.edu Less focus on interconnect theory More on interconnects in actual devices Read [AggLew94], [ChaWon96A], [Deh96A] for more details August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Recap Various FPGA programming technologies (Anti-fuse, (E)EPROM, Flash, SRAM): SRAM most popular August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing LUTs and Digital Logic k inputs  2k possible input values k-LUT corresponds to 2k x 1 bit memory Truth table is stored 22k possible functions – O(22k / k!) unique F = A0A1A2 + Ā0A1Ā2 + Ā0 Ā1 Ā2 0 0 0 0 0 1 0 1 0 0 1 1 A0 A1 A2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 0 0 1 0 1 1 1 0 1 1 1 255 ... August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Outline Recap General Routing Architectures FPGA Architectural Issues Early Commercial FPGAs Xilinx XC3000 Xilinx XC4000 Technology Mapping using LUTs August 31, 2006 CprE 583 – Reconfigurable Computing

General Routing Architecture A wire segment is a wire unbroken by programmable switches A track is a sequence of one or more wire segments in a line A routing channel is a group of parallel tracks A connection block provides connectivity from the inputs and outputs of a logic block to the wire segments in the channels A switch block is a block which provides connectivity between the horizontal and vertical wire segments on all four of its sides CprE 583 – Reconfigurable Computing August 31, 2006

CprE 583 – Reconfigurable Computing Switch Boxes Fs – connections offered per incoming wire Universal switchbox can connect any set of inputs to their target output channels simultaneously Build-able with Fs = 3 Xilinx XC4000 switchbox is Fs = 3 but not universal Read [ChaWon96A] for more details August 31, 2006 CprE 583 – Reconfigurable Computing

Architectural Issues [AhmRos04A] What values of N, I, and K minimize the following parameters? Area Delay Area-delay product Assumptions All routing wires length 4 Fully populated IMUX Wiring is half pass transistor, half tri-state August 31, 2006 CprE 583 – Reconfigurable Computing

Number of Inputs per Cluster Lots of opportunities for input sharing in large clusters [BetRos97A] Reducing inputs reduces the size of the device and makes it faster Most FPGA devices (Xilinx) have 4 BLE per cluster with more inputs than actually needed August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Logic Cluster Size Small block cluster more efficient Includes area needed for routing Smallest clusters (e.g. one BLE per cluster) not “CAD friendly” Most commercial devices have 4-8 BLEs per cluster August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Effect of N and K on Area Cluster size of N = [6-8] is good, K = [4-5] August 31, 2006 CprE 583 – Reconfigurable Computing

Effect of N and K on Performance Inconclusive: Big K and N > 3 value looks good August 31, 2006 CprE 583 – Reconfigurable Computing

Effect of N and K on Area-Delay K = 4-6, N= 4-10 looks OK August 31, 2006 CprE 583 – Reconfigurable Computing

Putting it All Together Area: LUT count decreases with k (slower than exponential) LUT size increases with k (exponential logic area, ~linear interconnect area) Delay: LUT depth decreases with k (logarithmic) LUT delay increases with k (linear) Examples: Xilinx XC3000 family Fs = 3 I = 5 N = 2 Xilinx XC4000 family I = 9 N ~ 2.5 August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing XC3000 Logic Block 5-LUT, or two 4-LUTs August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing XC4000 Logic Block August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing XC4000 Routing Structure August 31, 2006 CprE 583 – Reconfigurable Computing

XC4000 Routing Structure (cont.) August 31, 2006 CprE 583 – Reconfigurable Computing

LUT Computational Limits k-LUT can implement 22k functions Given n such k-LUTs, can implement (22k)n Since 4-LUTs are efficient, want to find n such that (224)n >= 22M Example – implementing a 7-LUT with 4-LUTs: 4-LUT A0–A3 A4 A5 A6 CprE 583 – Reconfigurable Computing August 31, 2006

LUT Computational Limits (cont.) How much computation can be performed in a table lookup? Upper bound (from previous) – n <= 2M-3 Need n 4-LUTs to cover a M-LUT: (224)n >= 22M nlog(224) >= log(22M) n24 log(2) >= 2M log(2) n24 >= 2M n >= 2M-4 Adding upper bound – 2M-4 <= n <= 2M-3 August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing LUTs Versus Memories Can also implement (22k)w as a single large memory with k inputs and w outputs Large memory advantage – no need for interconnect and only one input decoder required Consider a 32K x 8bit memory (170M λ2, 21ns latency) w = 8 k = 16 (or 2 8-bit inputs to address 216 locations) Can implement an 8-bit addition or subtraction Xilinx XC3042 – 288 4-LUTs (180M λ2, 13ns CLB delay) 15-bit parity calculation: 5 4-LUTs (<2% of XC4032) – 3.125M λ2) Entire SRAM – 170M λ2 7-bit addition: 14 4-LUTs (<5% of XC4032) – 8.75M λ2) August 31, 2006 CprE 583 – Reconfigurable Computing

LUT Technology Mapping Task: map netlist to LUTs, minimizing area and/or delay Similar to technology mapping for traditional designs Library approach not feasible – O(22k / k!) elements in library In general it is NP-hard August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Area vs. Delay Mapping We note that delay and area optimization are somewhat orthogonal concepts; often we obtain one at the expense of the other. In the slide below, we see two possible mappings for LUTs where K=4. The mapping on the left obtains a delay of 2 LUT delays from inputs to output, while the mapping on the right gives 3 LUT delays through the network. However, the mapping on the right uses only 3 LUTs, while the mapping on the left uses 4 LUTs. August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Decomposition As noted earlier, the decomposition of the network into gates can also affect the mapping. In the slide below, note that the topmost network is mapped naively into 4 LUTs (again K=4). The decompositions at the bottom show how different (smaller) mappings can be found, depending on how the OR gate is decomposed; the decomposition on the right does not allow a mapping into 2 LUTs. August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Why Replicate? Node replication in the network can also simplify the mapping; the network on the left cannot be mapped into fewer than 3 4-input LUTs; whereas the network on the right can fit in 2 LUTs. August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Reconvergence We must also take advantage of reconvergence in the network; in the figure below, the mapping on the left does not look far back enough (towards the inputs) to find the superior mapping indicated on the right August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Dynamic Programming 1 1 1 1 1 1 1 1 1 1 1 1 Dynamic programming is used in both Chortle and FlowMap. The target network is traversed from inputs to outputs (topological order); for any node, the optimal (local) solution can be found by combining the solutions at nodes which appear above the node being considered (i.e. closer to the inputs). Since these nodes have already been evaluated, this task is greatly simplified. In the example here, the mapping goal is to minimize delay. We iterate through the nodes in order from the inputs. At each node, we can consider a LUT which implements that node; the delay for such an implementation is then one plus the maximum delay over all nodes which feed the LUT under consideration. The minimum delay for the node is thus the minimum delay over all LUT implementations of that node. In the diagram on the left, each loop represents the minimum depth LUT which implements each node, with the delay of the LUT written inside the loop. The figure on the right shows the corresponding mapping for area; the value of a node in this case is thus the sum of the values of the nodes which fanin to the LUT plus 1. 2 1 2 1 2 3 August 31, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Summary FPGA design issues involve number of logic blocks per cluster, number of inputs per logic block, routing architecture, and k-LUT size Can build M-LUT with n k-LUTs where 2M-3 <= n <= 2M-4 Large LUTs generally inefficient Technology mapping is simplified because of 4-LUT properties Techniques – decomposition, replication, reconvergence, dynamic programming Area- or delay-optimal mapping still NP hard August 31, 2006 CprE 583 – Reconfigurable Computing