Penn ESE680-002 Spring2007 -- DeHon 1 ESE680-002 (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.

Xilinx CPLDs and FPGAs Module F2-1. CPLDs and FPGAs XC9500 CPLD XC4000 FPGA Spartan FPGA Spartan II FPGA Virtex FPGA.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.

HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,

Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.

CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.

HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 10: February 12, 2007 Empirical Comparisons.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.

CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.

CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.

Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 14: February 28, 2007 Interconnect 2: Wiring Requirements and Implications.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 12: February 21, 2007 Compute 2: Cascades, ALUs, PLAs.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day12:

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

J. Christiansen, CERN - EP/MIC

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

Timing and Constraints “The software is the lens through which the user views the FPGA.” -Bill Carter.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

What is it and why do we need it? Chris Ward CS147 10/16/2008.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 17: March 29, 2010 Interconnect 2: Wiring Requirements and Implications.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.

ESE532: System-on-a-Chip Architecture

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

CS137: Electronic Design Automation

Morgan Kaufmann Publishers Memory & Cache

ESE534: Computer Organization

ESE534: Computer Organization

The Xilinx Virtex Series FPGA

ESE534: Computer Organization

Lecture 20: OOO, Memory Hierarchy

CS184a: Computer Architecture (Structures and Organization)

The Xilinx Virtex Series FPGA

ESE534: Computer Organization

ESE534: Computer Organization

Presentation transcript:

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance

Penn ESE Spring DeHon 2 Last Time Saw how to formulate and automate retiming: –start with network –calculate minimum achievable c c = cycle delay (clock cycle) –make c-slow if want/need to make c=1 –calculate new register placements and move

Penn ESE Spring DeHon 3 Last Time Systematic transformation for retiming –“justify” mandatory registers in design

Penn ESE Spring DeHon 4 Today Retiming in the Large Retiming Requirements Retiming Structures

Penn ESE Spring DeHon 5 Retiming in the Large

Penn ESE Spring DeHon 6 Align Data / Balance Paths Day3: registers to align data

Penn ESE Spring DeHon 7 Serialization –greater serialization  deeper retiming –total: same per compute: larger

Penn ESE Spring DeHon 8 Data Alignment For video (2D) processing –often work on local windows –retime scan lines E.g. –edge detect –smoothing –motion est.

Penn ESE Spring DeHon 9 Image Processing See Data in raster scan order –adjacent, horizontal bits easy –adjacent, vertical bits scan line apart

Penn ESE Spring DeHon 10 Retiming in the Large Aside from the local retiming for cycle optimization (last time) Many intrinsic needs to retime data for correct use of compute engine –some very deep –often arise from serialization

Penn ESE Spring DeHon 11 Reminder: Temporal Interconnect Retiming  Temporal Interconnect Function of data memory –perform retiming

Penn ESE Spring DeHon 12 Requirements not Unique Retiming requirements are not unique to the problem Depends on algorithm/implementation Behavioral transformations can alter significantly

Penn ESE Spring DeHon 13 Requirements Example For I  1 to N –t1[I]  A[I]*B[I] For I  1 to N –t2[I]  C[I]*D[I] For I  1 to N –t3[I]  E[I]*F[I] For I  1 to N –t2[I]  t1[I]+t2[I] For I  1 to N –Q[I]  t2[I]+t3[I] For I  1 to N –t1  A[I]*B[I] –t2  C[I]*D[I] –t1  t1+t2 –t2  E[I]*F[I] –Q[I]  t1+t2 left => 3N regs right => 2 regs Parallelism? Q=A*B+C*D+E*F

Penn ESE Spring DeHon 14 Retiming Requirements

Penn ESE Spring DeHon 15 Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle –MCNC benchmarks to LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7)

Penn ESE Spring DeHon 16 Flop Experiment #2 Pipeline and retime to HSRA cycle –place on HSRA –single LUT or interconnect timing domain –same MCNC benchmarks –average 4.7 registers/LUT

Penn ESE Spring DeHon 17 Value Reuse Profiles What is the distribution of retiming distances needed? –Balance of retiming and compute –Fraction which need various depths –Like wire-length distributions….

Penn ESE Spring DeHon 18 Value Reuse Profiles [Huang&Shen/Micro 1995]

Penn ESE Spring DeHon 19 Example Value Reuse Profile [Huang&Shen/Micro 1995]

Penn ESE Spring DeHon 20 Interpreting VRP Huang and Shen data assume small number of Ops per cycle What happens if exploit more parallelism? –Values reused more frequently –Distances shorten

Penn ESE Spring DeHon 21 Serialization –greater serialization  deeper retiming –total: same per compute: larger Recall

Penn ESE Spring DeHon 22 Idea Task, implemented with a given amount of parallelism –Will have a distribution of retiming requirements –May differ from task to task –May vary independently from compute/interconnect requirements –Another balance issue to watch –May need a canonical way to measure Like Rent?

Penn ESE Spring DeHon 23 Retiming Structure

Penn ESE Spring DeHon 24 Structures How do we implement programmable retiming? Concerns: –Area: 2 /bit –Throughput: bandwidth (bits/time) –Latency important when do not know when we will need data item again

Penn ESE Spring DeHon 25 Just Logic Blocks Most primitive –build flip-flop out of logic blocks I  D*/Clk + I*Clk Q  Q*/Clk + I*Clk –Area: 2 LUTs (800K  1M 2 /LUT each) –Bandwidth: 1b/cycle

Penn ESE Spring DeHon 26 Optional Output Real flip-flop (optionally) on output –flip-flop: 4-5K 2 –Switch to select: ~ 5K 2 –Area: 1 LUT (800K  1M 2 /LUT) –Bandwidth: 1b/cycle

Penn ESE Spring DeHon 27 Separate Flip-Flops Network flip flop w/ own interconnect  can deploy where needed  requires more interconnect  Vary LUT/FF ratio Arch. Parameter Assume routing  inputs 1/4 size of LUT Area: 200K 2 each Bandwidth: 1b/cycle

Penn ESE Spring DeHon 28 Deeper Options Interconnect / Flip-Flop is expensive How do we avoid?

Penn ESE Spring DeHon 29 Deeper Implication –don’t need result on every cycle –number of regs>bits need to see each cycle –  lower bandwidth acceptable  less interconnect

Penn ESE Spring DeHon 30 Deeper Retiming

Penn ESE Spring DeHon 31 Output Single Output –Ok, if don’t need other timings of signal Multiple Output –more routing

Penn ESE Spring DeHon 32 Input More registers (K  ) –7-10K 2 /register –4-LUT => 30-40K 2 /depth No more interconnect than unretimed –open: compare savings to additional reg. cost Area: 1 LUT (1M+d*40K 2 ) get Kd regs d=4, 1.2M 2 Bandwidth: K/cycle 1/d th capacity

Penn ESE Spring DeHon 33 HSRA Input

Penn ESE Spring DeHon 34 Input Retiming

Penn ESE Spring DeHon 35 HSRA Interconnect

Penn ESE Spring DeHon 36 Flop Experiment #2 Pipeline and retime to HSRA cycle –place on HSRA –single LUT or interconnect timing domain –same MCNC benchmarks –average 4.7 registers/LUT Recall

Penn ESE Spring DeHon 37 Input Depth Optimization Real design, fixed input retiming depth –truncate deeper and allocate additional logic blocks

Penn ESE Spring DeHon 38 Extra Blocks (limited input depth) AverageWorst Case Benchmark

Penn ESE Spring DeHon 39 With Chained Dual Output AverageWorst Case Benchmark [can use one BLB as 2 retiming-only chains]

Penn ESE Spring DeHon 40 HSRA Architecture

Penn ESE Spring DeHon 41 Register File From MIPS-X –1K 2 /bit /port –Area(RF) = (d+6)(W+6)(1K 2 +ports* ) w>>6,d>>6 I+o=2 => 2K 2 /bit w=1,d>>6 I=o=4 => 35K 2 /bit –comparable to input chain More efficient for wide-word cases

Penn ESE Spring DeHon 42 Xilinx CLB Xilinx 4K CLB –as memory –works like RF Area: 1/2 CLB (640K 2 )/16  40K 2 /bit –but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity

Penn ESE Spring DeHon 43 Virtex SRL16 Xilinx Virtex 4-LUT –Use as 16b shiftreg Area: ~1M 2 /16  60K 2 /bit –Does not need CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity

Penn ESE Spring DeHon 44 Memory Blocks SRAM bit  (large arrays) DRAM bit  (large arrays) Bandwidth: W bits / 2 cycles –usually single read/write –1/2 A th capacity

Penn ESE Spring DeHon 45 Disk Drive Cheaper per bit than DRAM/Flash –(not MOS, no 2 ) Bandwidth: 150MB/s –For 1ns array cycle

Penn ESE Spring DeHon 46 Hierarchy/Structure Summary “Memory Hierarchy” arises from area/bandwidth tradeoffs –Smaller/cheaper to store words/blocks (saves routing and control) –Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) –High bandwidth out of registers/shallow memories

Penn ESE Spring DeHon 47 Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16) Embedded RAMs (16Kb) Interface off-chip DRAM (~0.1—1Gb) No retiming in interconnect –….yet

Penn ESE Spring DeHon 48 Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) –Lots of pipelining in memory path… Reorder Buffer (4—32) Architected RF (16, 32, 128) Actual RF (256, 512…) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (~10-100Gb)

Penn ESE Spring DeHon 49 Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) Retiming requirements affected by high-level decisions/strategy in solving task Wide variety of retiming costs –100 2  1M 2 Routing and I/O bandwidth –big factors in costs Gives rise to memory (retiming) hierarchy