CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
Xilinx CPLDs and FPGAs Module F2-1. CPLDs and FPGAs XC9500 CPLD XC4000 FPGA Spartan FPGA Spartan II FPGA Virtex FPGA.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Chapter 6 Memory and Programmable Logic Devices
GPGPU platforms GP - General Purpose computation using GPU
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Section II Basic PLD Architecture. Section II Agenda  Basic PLD Architecture —XC9500 and XC4000 Hardware Architectures —Foundation and Alliance Series.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day12:
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Why we need adjustable delay? The v1495 mezzanine card (A395A) have a signal transmission time about 6ns. But we need all the signals go into the look.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
Timing and Constraints “The software is the lens through which the user views the FPGA.” -Bill Carter.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
Buffering Techniques Greg Stitt ECE Department University of Florida.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Memories.
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
XILINX FPGAs Xilinx lunched first commercial FPGA XC2000 in 1985
Morgan Kaufmann Publishers Memory & Cache
ESE534: Computer Organization
The Xilinx Virtex Series FPGA
ESE534: Computer Organization
The Xilinx Virtex Series FPGA
ESE534: Computer Organization
Pipelined Array Multiplier Aldec Active-HDL Design Flow
Presentation transcript:

CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures

Last Time Retiming transformations to reduce cycle time w/ C-slow can place registers around every compute stage

Today Retiming “in the large” Notes on requirements Structures to support retiming

Filtering Windowed Average filter –(similar for echo cancellation)

Systolic Data Alignment Similar for bit-skewed arithmetic

Serialization –greater serialization => deeper retiming –total: same per compute: larger

Data Alignment For video (2D) processing –often work on local windows –retime scan lines

Wavelett Data stream for horizontal transform Data stream for vertical transform –N=image width

Retiming in the Large Aside from the local retiming for cycle optimization (last time) Many intrinsic needs to retime data for correct use of compute engine –some very deep –often arise from serialization

Reminder: Temporal Interconnect Retiming  Temporal Interconnect Function of data memory –perform retiming

Requirements not Unique Retiming requirements are not unique to the problem Depends on algorithm/implementation Behavioral transformations can alter significantly

Requirements Example For I  1 to N –t1[I]  A[I]*B[I] For I  1 to N –t2[I]  C[I]*D[I] For I  1 to N –t3[I]  E[I]*F[I] For I  1 to N –t2[I]  t1[I]+t2[I] For I  1 to N –Q[I]  t2[I]+t3[I] For I  1 to N –t1  A[I]*B[I] –t2  C[I]*D[I] –t1  t1+t2 –t2  E[I]*F[I] –Q[I]  t1+t2 left => 3N regs right => 2 regs Q=A*B+C*D+E*F

Structures How do we implement programmable retiming? Concerns: –Area: 2 /bit –Throughput: bandwidth (bits/time) –Latency important when do not know when we will need data item again

Just Logic Blocks Most primitive –build flip-flop out of logic blocks I  D*/Clk + I*Clk Q  Q*/Clk + I*Clk –Area: 2 LUTs (800K  1M 2 /LUT each) –Bandwidth: 1b/cycle

Optional Output Real flip-flop (optionally) on output –flip-flop: 4-5K 2 –Switch to select: ~ 5K 2 –Area: 1 LUT (800K  1M 2 /LUT) –Bandwidth: 1b/cycle

Output Flip-Flop Needs Pipeline and C-slow to LUT cycle Always need an output register Average Regs/LUT 1.7, some designs need 2--7x

Separate Flip-Flops Network flip flop w/ own interconnect  can deploy where needed  requires more interconnect Assume routing goes as inputs  1/4 size of LUT  Area: 200K 2 each –Bandwidth: 1b/cycle

Deeper Options Interconnect / Flip-Flop is expensive How do we avoid?

Deeper Implication –don’t need result on every cycle –number of regs < bits need to see each cycle –=> lower bandwidth acceptable => less interconnect

Deeper Retiming

Output Single Output –Ok, if don’t need other timings of signal Multiple Output –more routing

Input More registers (K  ) –7-10K 2 /register –4-LUT => 30-40K 2 /depth No more interconnect than unretimed –open: compare savings to additional reg. cost  Area: 1 LUT (1M+d*40K 2 ) get Kd regs  d=4, 1.2M 2 –Bandwidth: 1b/cycle –1/d th capacity

HSRA Input

Input Flip-Flop Requirements Before Interconnect Delays After Interconnect Delays

Extra Blocks (limited input depth) AverageWorst Case Benchmark

With Chained Dual Output AverageWorst Case Benchmark

Register File From MIPS-X –1K 2 /bit /port –Area(RF) = (d+6)(W+6)(1K 2 +ports* ) w>>6,d>>6 I+o=2 => 2K 2 /bit w=1,d>>6 I=o=4 => 35K 2 /bit –comparable to input chain More efficient for wide-word cases

Xilinx CLB Xilinx 4K CLB –as memory –works like RF Area: 1/2 CLB (640K 2 )/16  40K 2 /bit –but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity

Memory Blocks SRAM bit  (large arrays) DRAM bit  (large arrays) Bandwidth: W bits / 2 cycles –usually single read/write –1/2 A th capacity

Disk Drive Cheaper per bit than DRAM/Flash –(not MOS, no 2 ) Bandwidth: 10-20Mb/s –For 4ns array cycle 1b/12.5

Hierarchy/Structure Summary “Memory Hierarchy” arises from area/bandwidth tradeoffs –Smaller/cheaper to store words/blocks (saves routing and control) –Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) –High bandwidth out of registers/shallow memories

Summary Tasks have a wide variety of retiming distances Retiming requirements affected by high- level decisions/strategy in solving task Wide variety of retiming costs –100 2  1M 2 Routing and I/O bandwidth –big factors in costs Gives rise to memory (retiming) hierarchy