CS184a: Computer Architecture (Structure and Organization)

Slides:

Advertisements

Similar presentations

Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.

Advertisements

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day4: October 4, 2000 Memories, ALUs, and Virtualization.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.

CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.

Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 12: February 21, 2007 Compute 2: Cascades, ALUs, PLAs.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 4, 2005 Interconnect 1: Requirements.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 15: February 12, 2003 Interconnect 5: Meshes.

CBSSS 2002: DeHon Costs André DeHon Wednesday, June 19, 2002.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 18, 2005 Interconnect 6: MoT.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 16: February 14, 2003 Interconnect 6: MoT.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day7: October 16, 2000 Instruction Space (computing landscape)

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:

Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

CBSSS 2002: DeHon Universal Programming Organization André DeHon Tuesday, June 18, 2002.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 12: February 5, 2003 Interconnect 2: Wiring Requirements.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.

ESE532: System-on-a-Chip Architecture

Field Programmable Gate Arrays

ESE534: Computer Organization

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

ESE534: Computer Organization

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

CS184a: Computer Architecture (Structure and Organization)

XILINX FPGAs Xilinx lunched first commercial FPGA XC2000 in 1985

ESE534 Computer Organization

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

CSE 370 – Winter 2002 – Comb. Logic building blocks - 1

ESE534: Computer Organization

Topics Circuit design for FPGAs: Logic elements. Interconnect.

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

ESE534: Computer Organization

CS184a: Computer Architecture (Structures and Organization)

CS184a: Computer Architecture (Structures and Organization)

CS184a: Computer Architecture (Structure and Organization)

ESE534: Computer Organization

ESE534: Computer Organization

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

ESE534: Computer Organization

CS184a: Computer Architecture (Structure and Organization)

CprE / ComS 583 Reconfigurable Computing

ESE532: System-on-a-Chip Architecture

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

CS184a: Computer Architecture (Structure and Organization) Day 9: January 29, 2003 Compute 1: LUTs Caltech CS184 Winter2003 -- DeHon

Previously Instruction Space Modeling Empirical Comparisons huge range of densities huge range of efficiencies large architecture space modeling to understand design space Empirical Comparisons Ground cost of programmability Caltech CS184 Winter2003 -- DeHon

Today Look at Programmable Compute Blocks Specifically LUTs Today Recurring theme: define parameterized space identify costs and benefits look at typical application requirements compose results, try to find best point Caltech CS184 Winter2003 -- DeHon

Compute Function What do we use for “compute” function Any Universal NANDx ALU LUT Caltech CS184 Winter2003 -- DeHon

Lookup Table Load bits into table Table translation 2N bits to describe  22N different functions Table translation performs logic transform Caltech CS184 Winter2003 -- DeHon

Lookup Table Caltech CS184 Winter2003 -- DeHon

We could... Just build a large memory = large LUT Put our function in there What’s wrong with that? Caltech CS184 Winter2003 -- DeHon

FPGA = Many small LUTs Alternative to one big LUT Caltech CS184 Winter2003 -- DeHon

Toronto FPGA Model Caltech CS184 Winter2003 -- DeHon

What’s best to use? Small LUTs Large Memories …small LUTs or large LUTs …or, how big should our memory blocks used to peform computation be? Caltech CS184 Winter2003 -- DeHon

Start to Sort Out: Big vs. Small Luts Establish equivalence how many small LUTs equal one big LUT? Caltech CS184 Winter2003 -- DeHon

“gates” in 2-LUT ? Caltech CS184 Winter2003 -- DeHon

How Much Logic in a LUT? Lower Bound? Not use all inputs? Concrete: 4-LUTs to implement M-LUT? Not use all inputs? 0 … maybe 1 Use all inputs? (M-1)/3 example M-input AND cover 4 ins w/ first 4-LUT, 3 more and cascade input with each additional (M-1)/k for K-lut Caltech CS184 Winter2003 -- DeHon

How much logic in a LUT? Upper Upper Bound: M-LUT implemented w/ 4-LUTs M-LUT  2M-4+(2M-4-1)  2M-3 4-LUTs Caltech CS184 Winter2003 -- DeHon

How Much? Lower Upper Bound: (224)n  22M nlog(224) log(22M) 22M functions realizable by M-LUT Say Need n 4-LUTs to cover; compute n: strategy count functions realizable by each (224)n  22M nlog(224) log(22M) n24log(2)  2Mlog(2) n24  2M n  2M-4 Caltech CS184 Winter2003 -- DeHon

How Much? 2M-4  n 2M-3 Combine Lower Upper Bound Upper Lower Bound (number of 4-LUTs in M-LUT) 2M-4  n 2M-3 Caltech CS184 Winter2003 -- DeHon

Memories and 4-LUTs For the most complex functions an M-LUT has ~2M-4 4-LUTs SRAM 32Kx8 l=0.6mm 170Ml2 (21ns latency) 8*211 =16K 4-LUTs XC3042 l=0.6mm 180Ml2 (13ns delay per CLB) 288 4-LUTs Memory is 50+x denser than FPGA …and faster Caltech CS184 Winter2003 -- DeHon

Memory and 4-LUTs For “regular” functions? 15-bit parity 7b Add entire 32Kx8 SRAM 5 4-LUTs (2% of XC3042 ~ 3.2Ml2~1/50th Memory) 7b Add 14 4-LUTs (5% of XC3042, 8.8Ml2~1/20th Memory) Caltech CS184 Winter2003 -- DeHon

LUT + Interconnect Interconnect allows us to exploit structure in computation Already know LUT Area << Interconnect Area Area of an M-LUT on FPGA >> M-LUT Area …but most M-input functions complexity << 2M Caltech CS184 Winter2003 -- DeHon

Different Instance, Same Concept Most general functions are huge Applications exhibit structure Exploit structure to optimize “common” case Caltech CS184 Winter2003 -- DeHon

LUT Count vs. base LUT size Caltech CS184 Winter2003 -- DeHon

LUT vs. K DES MCNC Benchmark moderately irregular Caltech CS184 Winter2003 -- DeHon

Toronto Experiments Want to determine best K for LUTs Bigger LUTs handle complicated functions efficiently less interconnect overhead Smaller LUTs handle regular functions efficiently interconnect allows exploitation of compute sturcture What’s the typical complexity/structure? Caltech CS184 Winter2003 -- DeHon

Familiar Systematization Define a design/optimization space pick key parameters e.g. K = number of LUT inputs Build a cost model Map designs Look at resource costs at each point Compose: Logical ResourcesResource Cost Look for best design points Caltech CS184 Winter2003 -- DeHon

Toronto LUT Size Map to K-LUT Route to determine wiring tracks use Chortle Route to determine wiring tracks global route different channel width W for each benchmark Area Model for K and W Alut exponential in K Interconnect area based on switch count. Caltech CS184 Winter2003 -- DeHon

LUT Area vs. K Routing Area roughly linear in K ? Caltech CS184 Winter2003 -- DeHon

Mapped LUT Area Compose Mapped LUTs and Area Model Caltech CS184 Winter2003 -- DeHon

Mapped Area vs. LUT K N.B. unusual case minimum area at K=3 Caltech CS184 Winter2003 -- DeHon

Toronto Result Minimum LUT Area at K=4 Important to note minimum on previous slides based on particular cost model robust for different switch sizes (wire widths) [see graphs in paper] Caltech CS184 Winter2003 -- DeHon

Implications Caltech CS184 Winter2003 -- DeHon

Implications Custom? / Gate Arrays? More restricted logic functions? Caltech CS184 Winter2003 -- DeHon

Relate to Sequential? How does this result relate to sequential execution case? Number of LUTs = Number of Cycles Interconnect Cost? Naïve structure in practice? Instruction Cost? Caltech CS184 Winter2003 -- DeHon

Delay Back to Spatial Caltech CS184 Winter2003 -- DeHon

Delay? Circuit Depth in LUTs? “Simple Function”  M-input AND 1 table lookup in M-LUT logk(M) lookups in K-LUT Caltech CS184 Winter2003 -- DeHon

Delay? M-input “Complex” function logk(2(M-k))=(M-k)logk(2) 1 table lookup for M-LUT Lower bound: logk(2(M-k)) +1 logk(2(M-k))=(M-k)logk(2) Caltech CS184 Winter2003 -- DeHon

Some Math (M-k)logk(2) (M-k)/log2(k) Y=logk(2) kY = 2 Ylog2(k) = 1 logk(2)=1/log2(k) (M-k)logk(2) (M-k)/log2(k) Caltech CS184 Winter2003 -- DeHon

Delay? M-input “Complex” function logk(2(M-k))=(M-k)logk(2) Lower bound: logk(2(M-k)) +1 logk(2(M-k))=(M-k)logk(2) Lower Bound: (M-k)/log2(k) +1 Caltech CS184 Winter2003 -- DeHon

Delay? M-input “Complex” function Upper Bound: use each k-lut as a k- log2(k) input mux Upper Bound: (M-k)/log2(k- log2(k))+1 Caltech CS184 Winter2003 -- DeHon

Delay? M-input “Complex” function 1 table lookup for M-LUT between: (M-k)/log2(k) +1 and (M-k)/log2(k- log2(k))+1 Caltech CS184 Winter2003 -- DeHon

Delay Both scale as 1/log(k) Simple: log M Complex: linear in M Caltech CS184 Winter2003 -- DeHon

Circuit Depth vs. K Caltech CS184 Winter2003 -- DeHon

LUT Delay vs. K tLUTc0+c1K For small LUTs: Large LUTs: add length term c2 2K Plus Wire Delay ~area Caltech CS184 Winter2003 -- DeHon

Delay vs. K Why not satisfied with this model? Delay = Depth  (tLUT+ tInterconnect) Caltech CS184 Winter2003 -- DeHon

Observation General interconnect is expensive “Larger” logic blocks less interconnect crossing lower interconnect delay get larger get slower Happens faster than modeled here due to area less area efficient don’t match structure in computation Caltech CS184 Winter2003 -- DeHon

Big Ideas [MSB Ideas] Memory most dense programmable structure for the most complex functions Memory inefficient (scales poorly) for structured compute tasks Most tasks have some structure Programmable interconnect allows us to exploit that structure Caltech CS184 Winter2003 -- DeHon

Big Ideas [MSB-1 Ideas] Area LUT count decrease w/ K, but slower than exponential LUT size increase w/ K exponential LUT function empirically linear routing area Minimum area around K=4 Caltech CS184 Winter2003 -- DeHon

Big Ideas [MSB-1 Ideas] Delay LUT depth decreases with K in practice closer to log(K) Delay increases with K small K linear + large fixed term minimum around 5-6 Caltech CS184 Winter2003 -- DeHon