Presentation is loading. Please wait.

Presentation is loading. Please wait.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.

Similar presentations


Presentation on theme: "Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming."— Presentation transcript:

1 Penn ESE534 Spring2010 -- DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming

2 Penn ESE534 Spring2010 -- DeHon 2 Today Retiming Demand –Folded Computation –Logical Pipelining –Physical Pipelining Retiming Supply –Technology –Structures –Hierarchy

3 Penn ESE534 Spring2010 -- DeHon 3 Retiming Demand

4 Image Processing Many operations can be described in terms of 2D window filters –Compute value as a weighted sum of the neighborhood –Out[x][y]=in[x][y]*c[1][1]+in[x- 1][y]c[0][1]+in[x][y- 1]*c[1][0]+in[x+1][y]*c[2][1]+in[x][y+1]*c[1][2 ]…. Blurring, edge and feature detection, object recognition and tracking, motion estimation, VLSI design rule checking Penn ESE534 Spring2010 -- DeHon 4

5 Preclass 2 Describes window computation: 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 5

6 Preclass 2 How many times is each in[x][y] used? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 6

7 Preclass 2 Sequentialized on one multiplier, –Distance between in[x][y] uses? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 7

8 Window Penn ESE534 Spring2010 -- DeHon 8

9 9 Parallel Scaling/Spatial Sum Compute entire window at once –More hardware, fewer cycles

10 Preclass 2 Unroll inner loop, –Distance between in[x][y] uses? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 10

11 Fully Spatial What if we gave each pixel its own processor? Penn ESE534 Spring2010 -- DeHon 11

12 Preclass 1 How many registers on each link? Penn ESE534 Spring2010 -- DeHon 12

13 Penn ESE534 Spring2010 -- DeHon 13 Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle –MCNC benchmarks to 256 4-LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7)

14 Penn ESE680-002 Fall2007 -- DeHon 14 Long Interconnect Path What happens if one of these links ends up on a long interconnect path?

15 Penn ESE680-002 Fall2007 -- DeHon 15 Pipeline Interconnect Path To avoid cycle being limited by longest interconnect –Pipeline network

16 DeHon Dagstuhl 06361 16 Chips >> Cycles Chips growing Gate delays shrinking Wire delays aren’t scaling down  Will take many cycles to cross chip

17 DeHon Dagstuhl 06361 17 Clock Cycle Radius Radius of logic can reach in one cycle (45 nm) –Radius 10 (preclass 20: L seg =5  50ps) Few hundred PEs –Chip side 600-700 PE 400-500 thousand PEs –100s of cycles to cross

18 Penn ESE680-002 Fall2007 -- DeHon 18 Pipelined Interconnect In what cases is this convenient?

19 Penn ESE680-002 Fall2007 -- DeHon 19 Pipelined Interconnect When might it pose a challenge?

20 Penn ESE680-002 Fall2007 -- DeHon 20 Long Interconnect Path What happens here? C C B A

21 Penn ESE680-002 Fall2007 -- DeHon 21 Long Interconnect Path Adds pipeline delays May force further pipelining of logic to balance out paths More registers C C B A

22 Penn ESE534 Spring2010 -- DeHon 22 Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle –MCNC benchmarks to 256 4-LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7) Reminder

23 Penn ESE534 Spring2010 -- DeHon 23 Flop Experiment #2 Pipeline and retime to HSRA cycle –place on HSRA –single LUT or interconnect timing domain –same MCNC benchmarks –average 4.7 registers/LUT [Tsu et al., FPGA 1999]

24 Penn ESE534 Spring2010 -- DeHon 24 Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism –Will have a distribution of retiming requirements –May differ from task to task –May vary independently from compute/interconnect requirements Another balance issue to watch –Balance with compute, interconnect Need a canonical way to measure –Like Rent?

25 Penn ESE534 Spring2010 -- DeHon 25 Retiming Supply

26 Penn ESE534 Spring2010 -- DeHon 26 Optional Output Flip-flop (optionally) on output –flip-flop: 4-5K 2 –Switch to select: ~ 5K 2 –Area: 1 LUT (800K  1M 2 /LUT) –Bandwidth: 1b/cycle

27 Penn ESE534 Spring2010 -- DeHon 27 Output Single Output –Ok, if don’t need other timings of signal Multiple Output –more routing

28 Penn ESE534 Spring2010 -- DeHon 28 Input More registers (K  ) –7-10K 2 /register+mux –4-LUT => 30-40K 2 /depth No more interconnect than unretimed –open: compare savings to additional reg. cost Area: 1 LUT (1M+d*40K 2 ) get Kd regs d=4, 1.2M 2 Bandwidth: K/cycle 1/d th capacity

29 Preclass 3 Penn ESE534 Spring2010 -- DeHon 29

30 Penn ESE534 Spring2010 -- DeHon 30 Some Numbers (memory) Unit of area = 2 (F=2  –[more next week] Register as stand-alone element  4K 2 –e.g. as needed/used last lecture Static RAM cell  1K 2 –SRAM Memory (single ported) Dynamic RAM cell (DRAM process)  100 2 Dynamic RAM cell (SRAM process)  300 2 Day 5

31 Penn ESE534 Spring2010 -- DeHon 31 Retiming Density LUT+interconnect  1M 2 Register as stand-alone element  4K 2 Static RAM cell  1K 2 –SRAM Memory (single ported) Dynamic RAM cell (DRAM process)  100 2 Dynamic RAM cell (SRAM process)  300 2 Can have much more retiming memory per chip if put it in large arrays –…but then cannot get to it as frequently

32 Penn ESE534 Spring2010 -- DeHon 32 Retiming Structure Concerns Area: 2 /bit Throughput: bandwidth (bits/time) Energy

33 Penn ESE534 Spring2010 -- DeHon 33 Just Logic Blocks Most primitive –build flip-flop out of logic blocks I  D*/Clk + I*Clk Q  Q*/Clk + I*Clk –Area: 2 LUTs (800K  1M 2 /LUT each) –Bandwidth: 1b/cycle

34 Penn ESE534 Spring2010 -- DeHon 34 Separate Flip-Flops Network flip flop w/ own interconnect  can deploy where needed  requires more interconnect  Vary LUT/FF ratio Arch. Parameter Assume routing  inputs 1/4 size of LUT Area: 200K 2 each Bandwidth: 1b/cycle

35 Penn ESE534 Spring2010 -- DeHon 35 Virtex SRL16 Xilinx Virtex 4-LUT –Use as 16b shiftreg Area: ~1M 2 /16  60K 2 /bit –Does not need CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity

36 Penn ESE534 Spring2010 -- DeHon 36 Register File Memory Bank From MIPS-X –1K 2 /bit + 500 2 /port –Area(RF) = (d+6)(W+6)(1K 2 +ports* 500 2 ) w>>6,d>>6 I+o=2 => 2K 2 /bit w=1,d>>6 I=o=4 => 35K 2 /bit –comparable to input chain More efficient for wide-word cases

37 Penn ESE534 Spring2010 -- DeHon 37 Xilinx CLB Xilinx 4K CLB –as memory –works like RF Area: 1/2 CLB (640K 2 )/16  40K 2 /bit –but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity

38 Penn ESE534 Spring2010 -- DeHon 38 Memory Blocks SRAM bit  1200 2 (large arrays) DRAM bit  100 2 (large arrays) Bandwidth: W bits / 2 cycles –usually single read/write –1/2 A th capacity

39 Penn ESE534 Spring2010 -- DeHon 39 Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 –640b, 9Kb, 144Kb Can put 1M/1K  1K bits in space of 4-LUT –Trade few 4-LUTs for considerable memory

40 Penn ESE534 Spring2010 -- DeHon 40 Hierarchy/Structure Summary “Memory Hierarchy” arises from area/bandwidth tradeoffs –Smaller/cheaper to store words/blocks (saves routing and control) –Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) –High bandwidth out of shallow memories –Applications have mix of retiming needs

41 Penn ESE534 Spring2010 -- DeHon 41 Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16) Embedded RAMs (9Kb,36Kb) Interface off-chip DRAM (~0.1—1Gb) No retiming in interconnect –….yet

42 Penn ESE534 Spring2010 -- DeHon 42 Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) –Lots of pipelining in memory path… Reorder Buffer (4—32) Architected RF (16, 32, 128) Actual RF (256, 512…) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (~10-100Gb)

43 Admin Final Exercise –Basic statement out –Will continue to refine and detail –One arch. Desc. to add by next Monday No new homework grading Reading for Wednesday on web Penn ESE534 Spring2010 -- DeHon 43

44 Penn ESE534 Spring2010 -- DeHon 44 Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) –Within design, among tasks Retiming requirements vary independently of compute, interconnect requirements (balance) Wide variety of retiming costs –100 2  1M 2 Routing and I/O bandwidth –big factors in costs Gives rise to memory (retiming) hierarchy


Download ppt "Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming."

Similar presentations


Ads by Google