Caltech CS184 Winter2003 -- DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Xilinx CPLDs and FPGAs Module F2-1. CPLDs and FPGAs XC9500 CPLD XC4000 FPGA Spartan FPGA Spartan II FPGA Virtex FPGA.
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
EECE579: Digital Design Flows
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Revision Mid 2 Prof. Sin-Min Lee Department of Computer Science.
CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day12:
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 12: February 5, 2003 Interconnect 2: Wiring Requirements.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.
Buffering Techniques Greg Stitt ECE Department University of Florida.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.
ESE532: System-on-a-Chip Architecture
Memories.
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
CS137: Electronic Design Automation
CS184a: Computer Architecture (Structure and Organization)
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
ESE534: Computer Organization
The Xilinx Virtex Series FPGA
ESE534: Computer Organization
CS184a: Computer Architecture (Structures and Organization)
The Xilinx Virtex Series FPGA
Presentation transcript:

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance

Caltech CS184 Winter DeHon 2 Last Time Saw how to formulate and automate retiming: –start with network –calculate minimum achievable c c = cycle delay (clock cycle) –make c-slow if want/need to make c=1 –calculate new register placements and move

Caltech CS184 Winter DeHon 3 Last Time Systematic transformation for retiming –“justify” mandatory registers in design

Caltech CS184 Winter DeHon 4 Today Retiming in the Large Retiming Requirements Retiming Structures

Caltech CS184 Winter DeHon 5 Retiming in the Large

Caltech CS184 Winter DeHon 6 Align Data / Balance Paths Day3: registers to align data

Caltech CS184 Winter DeHon 7 Systolic Data Alignment Bit-level max

Caltech CS184 Winter DeHon 8 Serialization –greater serialization  deeper retiming –total: same per compute: larger

Caltech CS184 Winter DeHon 9 Data Alignment For video (2D) processing –often work on local windows –retime scan lines E.g. –edge detect –smoothing –motion est.

Caltech CS184 Winter DeHon 10 Image Processing See Data in raster scan order –adjacent, horizontal bits easy –adjacent, vertical bits scan line apart

Caltech CS184 Winter DeHon 11 Wavelet Data stream for horizontal transform Data stream for vertical transform –N=image width

Caltech CS184 Winter DeHon 12 Retiming in the Large Aside from the local retiming for cycle optimization (last time) Many intrinsic needs to retime data for correct use of compute engine –some very deep –often arise from serialization

Caltech CS184 Winter DeHon 13 Reminder: Temporal Interconnect Retiming  Temporal Interconnect Function of data memory –perform retiming

Caltech CS184 Winter DeHon 14 Requirements not Unique Retiming requirements are not unique to the problem Depends on algorithm/implementation Behavioral transformations can alter significantly

Caltech CS184 Winter DeHon 15 Requirements Example For I  1 to N –t1[I]  A[I]*B[I] For I  1 to N –t2[I]  C[I]*D[I] For I  1 to N –t3[I]  E[I]*F[I] For I  1 to N –t2[I]  t1[I]+t2[I] For I  1 to N –Q[I]  t2[I]+t3[I] For I  1 to N –t1  A[I]*B[I] –t2  C[I]*D[I] –t1  t1+t2 –t2  E[I]*F[I] –Q[I]  t1+t2 left => 3N regs right => 2 regs Q=A*B+C*D+E*F

Caltech CS184 Winter DeHon 16 Retiming Structure and Requirements

Caltech CS184 Winter DeHon 17 Structures How do we implement programmable retiming? Concerns: –Area: 2 /bit –Throughput: bandwidth (bits/time) –Latency important when do not know when we will need data item again

Caltech CS184 Winter DeHon 18 Just Logic Blocks Most primitive –build flip-flop out of logic blocks I  D*/Clk + I*Clk Q  Q*/Clk + I*Clk –Area: 2 LUTs (800K  1M 2 /LUT each) –Bandwidth: 1b/cycle

Caltech CS184 Winter DeHon 19 Optional Output Real flip-flop (optionally) on output –flip-flop: 4-5K 2 –Switch to select: ~ 5K 2 –Area: 1 LUT (800K  1M 2 /LUT) –Bandwidth: 1b/cycle

Caltech CS184 Winter DeHon 20 Output Flip-Flop Needs Pipeline and C- slow to LUT cycle Always need an output register Average Regs/LUT 1.7, some designs need 2--7x

Caltech CS184 Winter DeHon 21 Separate Flip-Flops Network flip flop w/ own interconnect  can deploy where needed  requires more interconnect Assume routing  inputs 1/4 size of LUT Area: 200K 2 each Bandwidth: 1b/cycle

Caltech CS184 Winter DeHon 22 Deeper Options Interconnect / Flip-Flop is expensive How do we avoid?

Caltech CS184 Winter DeHon 23 Deeper Implication –don’t need result on every cycle –number of regs>bits need to see each cycle –  lower bandwidth acceptable  less interconnect

Caltech CS184 Winter DeHon 24 Deeper Retiming

Caltech CS184 Winter DeHon 25 Output Single Output –Ok, if don’t need other timings of signal Multiple Output –more routing

Caltech CS184 Winter DeHon 26 Input More registers (K  ) –7-10K 2 /register –4-LUT => 30-40K 2 /depth No more interconnect than unretimed –open: compare savings to additional reg. cost Area: 1 LUT (1M+d*40K 2 ) get Kd regs d=4, 1.2M 2 Bandwidth: K/cycle 1/d th capacity

Caltech CS184 Winter DeHon 27 HSRA Input

Caltech CS184 Winter DeHon 28 Input Retiming

Caltech CS184 Winter DeHon 29 HSRA Interconnect

Caltech CS184 Winter DeHon 30 Flop Experiment #1 Pipeline and retime to single LUT delay per cycle –MCNC benchmarks to LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7)

Caltech CS184 Winter DeHon 31 Flop Experiment #2 Pipeline and retime to HSRA cycle –place on HSRA –single LUT or interconnect timing domain –same MCNC benchmarks –average 4.7 registers/LUT

Caltech CS184 Winter DeHon 32 Input Depth Optimization Real design, fixed input retiming depth –truncate deeper and allocate additional logic blocks

Caltech CS184 Winter DeHon 33 Extra Blocks (limited input depth) AverageWorst Case Benchmark

Caltech CS184 Winter DeHon 34 With Chained Dual Output AverageWorst Case Benchmark [can use one BLB as 2 retiming-only chains]

Caltech CS184 Winter DeHon 35 HSRA Architecture

Caltech CS184 Winter DeHon 36 Register File From MIPS-X –1K 2 /bit /port –Area(RF) = (d+6)(W+6)(1K 2 +ports* ) w>>6,d>>6 I+o=2 => 2K 2 /bit w=1,d>>6 I=o=4 => 35K 2 /bit –comparable to input chain More efficient for wide-word cases

Caltech CS184 Winter DeHon 37 Xilinx CLB Xilinx 4K CLB –as memory –works like RF Area: 1/2 CLB (640K 2 )/16  40K 2 /bit –but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity

Caltech CS184 Winter DeHon 38 Memory Blocks SRAM bit  (large arrays) DRAM bit  (large arrays) Bandwidth: W bits / 2 cycles –usually single read/write –1/2 A th capacity

Caltech CS184 Winter DeHon 39 Disk Drive Cheaper per bit than DRAM/Flash –(not MOS, no 2 ) Bandwidth: 10-20Mb/s –For 4ns array cycle 1b/12.5

Caltech CS184 Winter DeHon 40 Hierarchy/Structure Summary “Memory Hierarchy” arises from area/bandwidth tradeoffs –Smaller/cheaper to store words/blocks (saves routing and control) –Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) –High bandwidth out of registers/shallow memories

Caltech CS184 Winter DeHon 41 Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16) Embedded RAMs (16Kb) Interface off-chip DRAM (~0.1—1Gb) No retiming in interconnect –….yet

Caltech CS184 Winter DeHon 42 Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) –Lots of pipelining in memory path… Reorder Buffer (4—32) Architected RF (16, 32, 128) Actual RF (256, 512…) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (~10-100Gb)

Caltech CS184 Winter DeHon 43 Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) Retiming requirements affected by high-level decisions/strategy in solving task Wide variety of retiming costs –100 2  1M 2 Routing and I/O bandwidth –big factors in costs Gives rise to memory (retiming) hierarchy