ESE534: Computer Organization

Slides:



Advertisements
Similar presentations
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
Advertisements

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 13: February 26, 2007 Interconnect 1: Requirements.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 18: March 21, 2007 Interconnect 6: MoT.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 12: February 21, 2007 Compute 2: Cascades, ALUs, PLAs.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 4, 2005 Interconnect 1: Requirements.
ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.
CBSSS 2002: DeHon Costs André DeHon Wednesday, June 19, 2002.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day7: October 16, 2000 Instruction Space (computing landscape)
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.
An Improved “Soft” eFPGA Design and Implementation Strategy
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:
Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
CBSSS 2002: DeHon Universal Programming Organization André DeHon Tuesday, June 18, 2002.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
ESE Spring DeHon 1 ESE534: Computer Organization Day 20: April 9, 2014 Interconnect 6: Direct Drive, MoT.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
ESE534: Computer Organization
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
XILINX FPGAs Xilinx lunched first commercial FPGA XC2000 in 1985
ESE534 Computer Organization
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
ESE534: Computer Organization
Topics Circuit design for FPGAs: Logic elements. Interconnect.
ESE534: Computer Organization
ESE534: Computer Organization
Day 15: October 13, 2010 Performance: Gates
CS184a: Computer Architecture (Structures and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
CprE / ComS 583 Reconfigurable Computing
ESE532: System-on-a-Chip Architecture
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

ESE534: Computer Organization Day 14: March 17, 2010 Compute 1: LUTs Penn ESE534 Spring2010 -- DeHon

Previously Instruction Space Modeling Empirical Comparisons huge range of densities huge range of efficiencies large architecture space modeling to understand design space Empirical Comparisons Ground cost of programmability Penn ESE534 Spring2010 -- DeHon

Today Look at Programmable Compute Blocks Specifically LUTs Recurring theme: define parameterized space identify costs and benefits look at typical application requirements compose results, try to find best point Penn ESE534 Spring2010 -- DeHon

Compute Function What do we use for “compute” function Any Universal NANDx ALU LUT Penn ESE534 Spring2010 -- DeHon

Lookup Table Load bits into table Table translation 2N bits to describe  22N different functions Table translation performs logic transform Penn ESE534 Spring2010 -- DeHon

Lookup Table Penn ESE534 Spring2010 -- DeHon

We could... Just build a large memory = large LUT Put our function in there What’s wrong with that? Penn ESE534 Spring2010 -- DeHon

How bit is an k-LUT? k-input, 1-output? k-input, m-output? Penn ESE534 Spring2010 -- DeHon

FPGA = Many small LUTs Alternative to one big LUT Penn ESE534 Spring2010 -- DeHon

Toronto FPGA Model Penn ESE534 Spring2010 -- DeHon

What’s best to use? Small LUTs Large Memories …small LUTs or large LUTs Continuum question: how big should our memory blocks used to perform computation be? Penn ESE534 Spring2010 -- DeHon

Start to Sort Out: Big vs. Small Luts Establish equivalence how many small LUTs equal one big LUT? Penn ESE534 Spring2010 -- DeHon

“gates” in 2-LUT ? Penn ESE534 Spring2010 -- DeHon

How Much Logic in a LUT? Lower Bound? Not use all inputs? Concrete: 4-LUTs to implement M-LUT? Not use all inputs? 0 … maybe 1 Use all inputs? (M-1)/3 example M-input AND cover 4 ins w/ first 4-LUT, 3 more and cascade input with each additional (M-1)/(k-1) for K-lut Penn ESE534 Spring2010 -- DeHon

How much logic in a LUT? Upper Upper Bound: M-LUT implemented w/ 4-LUTs M-LUT  2M-4+(2M-4-1)  2M-3 4-LUTs Penn ESE534 Spring2010 -- DeHon

How Much? Lower Upper Bound: (224)n  22M nlog(224) log(22M) 22M functions realizable by M-LUT Say Need n 4-LUTs to cover; compute n: strategy count functions realizable by each (224)n  22M nlog(224) log(22M) n24log(2)  2Mlog(2) n24  2M n  2M-4 Penn ESE534 Spring2010 -- DeHon

How Much? 2M-4 n 2M-3 Combine Lower Upper Bound Upper Lower Bound (number of 4-LUTs in M-LUT) 2M-4 n 2M-3 Penn ESE534 Spring2010 -- DeHon

Memories and 4-LUTs For the most complex functions SRAM 32Kx8 l=0.6mm an M-LUT has ~2M-4 4-LUTs SRAM 32Kx8 l=0.6mm 170Ml2 (21ns latency) 8*211 =16K 4-LUTs XC3042 l=0.6mm 180Ml2 (13ns delay per CLB) 288 4-LUTs Memory is 50+x denser than FPGA …and faster Penn ESE534 Spring2010 -- DeHon

Memory and 4-LUTs For “regular” functions? 15-bit parity entire 32Kx8 SRAM 5 4-LUTs (2% of XC3042 ~ 3.2Ml2~1/50th Memory) Penn ESE534 Spring2010 -- DeHon

16-bit Adder from Memory and 3-LUTs How many inputs? outputs? Area for single large LUT? How many 3-LUTs? Area per 3-LUT? Area to implement adder with 3-LUTs? Ratio? Penn ESE534 Spring2010 -- DeHon

Memory and 4-LUTs Same 32Kx8 SRAM 7b Add entire 32Kx8 SRAM (largest will support) 14 4-LUTs (5% of XC3042, 8.8Ml2~1/20th Memory) Penn ESE534 Spring2010 -- DeHon

LUT + Interconnect Interconnect allows us to exploit structure in computation Consider addition: N-input add takes 2N 3-LUTs one N-output (2N)-LUT N×2(2N) >> 2N×23 N=16: 16×232 >> 2×16×23 236 >> 28  factor of 228 =256 Million Penn ESE534 Spring2010 -- DeHon

LUT + Interconnect Interconnect allows us to exploit structure in computation Even if Interconnect was 99% of the area (100× logic area) Would still be worth paying! Add: N×2(2N) >> 2N×(23×128) N=16: 16×236 >> 2×16×210=215  factor of 221 =2 Million Structure exploitation to avoid exponential costs is worth it! Penn ESE534 Spring2010 -- DeHon

Different Instance of a Familiar Concept The most general functions are huge Applications exhibit structure Typical functions not so complex Exploit structure to optimize “common” case Penn ESE534 Spring2010 -- DeHon

LUT Count vs. base LUT size [Rose et al., JSSC 25(5):1217—1225, 1990] Penn ESE534 Spring2010 -- DeHon

LUT Count vs. base LUT size Simple: (M-1)/(K-1) Complex: 2(M-K) Penn ESE534 Spring2010 -- DeHon

LUT vs. K DES MCNC Benchmark moderately irregular Simple: (M-1)/(K-1) Complex: 2(M-K) Penn ESE534 Spring2010 -- DeHon

Toronto Experiments Want to determine best K for LUTs Bigger LUTs handle complicated functions efficiently less interconnect overhead Smaller LUTs handle regular functions efficiently interconnect allows exploitation of compute structure What’s the typical complexity/structure? [Rose et al., JSSC 25(5):1217—1225, 1990] Penn ESE534 Spring2010 -- DeHon

Standard Systematization Define a design/optimization space pick key parameters e.g. K = number of LUT inputs Build a cost model Map designs Look at resource costs at each point Compose: Logical ResourcesResource Cost Look for best design points Penn ESE534 Spring2010 -- DeHon

Toronto LUT Size Map to K-LUT Route to determine wiring tracks use Chortle Route to determine wiring tracks global route different channel width W for each benchmark Area Model for K and W Alut exponential in K Interconnect area based on switch count Penn ESE534 Spring2010 -- DeHon

LUT Area vs. K Routing Area roughly linear in K ? Penn ESE534 Spring2010 -- DeHon

LUT Area vs. K Interconnect ~ 20x logic Penn ESE534 Spring2010 -- DeHon

Mapped LUT Area Compose Mapped LUTs and Area Model Penn ESE534 Spring2010 -- DeHon

Mapped LUT Area Penn ESE680-002 Spring2007 -- DeHon

Mapped Area vs. LUT K N.B. unusual case minimum area at K=3 Penn ESE534 Spring2010 -- DeHon

Toronto Result Minimum LUT Area at K=4 Important to note minimum on previous slides based on particular cost model robust for different switch sizes (wire widths) [see graphs in paper] Penn ESE534 Spring2010 -- DeHon

Implications Custom? / Gate Arrays? More restricted logic functions? Penn ESE534 Spring2010 -- DeHon

Delay Penn ESE534 Spring2010 -- DeHon

Delay? Circuit Depth in LUTs? Lower bound? “Simple Function”  M-input AND 1 table lookup in M-LUT logk(M) lookups in K-LUT Penn ESE534 Spring2010 -- DeHon

Delay? M-input “Complex” function logk(2(M-k))=(M-k)logk(2) 1 table lookup for M-LUT Lower Upper bound: logk(2(M-k)) +1 logk(2(M-k))=(M-k)logk(2) Penn ESE534 Spring2010 -- DeHon

Some Math (M-k)logk(2) (M-k)/log2(k) Y=logk(2) kY = 2 Ylog2(k) = 1 logk(2)=1/log2(k) (M-k)logk(2) (M-k)/log2(k) Penn ESE534 Spring2010 -- DeHon

Delay? M-input “Complex” function logk(2(M-k))=(M-k)logk(2) Lower Upper bound: logk(2(M-k)) +1 logk(2(M-k))=(M-k)logk(2) Lower Upper Bound: (M-k)/log2(k) +1 Penn ESE534 Spring2010 -- DeHon

Delay? M-input “Complex” function Upper Bound: use each k-lut as a k- log2(k) input mux Upper Bound: (M-k)/log2(k- log2(k))+1 Penn ESE534 Spring2010 -- DeHon

Delay? M-input “Complex” function 1 table lookup for M-LUT between: (M-k)/log2(k) +1 and (M-k)/log2(k- log2(k))+1 Penn ESE534 Spring2010 -- DeHon

Delay Both scale as 1/log(k) Simple: log M Complex: linear in M Penn ESE534 Spring2010 -- DeHon

Circuit Depth vs. K [Rose et al., JSSC 27(3):281—287, 1992] Penn ESE534 Spring2010 -- DeHon

LUT Delay vs. K tLUTc0+c1K For small LUTs: Large LUTs: add length term c2 2K Plus Wire Delay ~area Penn ESE534 Spring2010 -- DeHon

Delay vs. K Why not satisfied with this model? Delay = Depth  (tLUT+ tInterconnect) Penn ESE534 Spring2010 -- DeHon

Delay vs. K (proper critical path interconnect) [Luu et al., FPGA 2009] Penn ESE534 Spring2010 -- DeHon

Observation General interconnect is expensive “Larger” logic blocks fewer interconnect crossings reduces interconnect delay get larger less area efficient don’t match structure in computation get slower Happens faster than modeled here due to area Penn ESE534 Spring2010 -- DeHon

Admin Reading Next Wednesday will have guest lecture Today’s: classic paper…definitely read Wed.  no required reading Next Wednesday will have guest lecture Relevant to final project Penn ESE534 Spring2010 -- DeHon

Big Ideas [MSB Ideas] Memory most dense programmable structure for the most complex functions Memory inefficient (scales poorly) for structured compute tasks Most tasks have structure Programmable interconnect allows us to exploit that structure Penn ESE534 Spring2010 -- DeHon

Big Ideas [MSB-1 Ideas] Area LUT count decrease w/ K, but slower than exponential LUT size increase w/ K exponential LUT function empirically linear routing area Minimum area around K=4 Penn ESE534 Spring2010 -- DeHon

Big Ideas [MSB-1 Ideas] Delay LUT depth decreases with K in practice closer to log(K) Delay increases with K small K linear + large fixed term Penn ESE534 Spring2010 -- DeHon