Download presentation
Presentation is loading. Please wait.
Published byEverett Patrick Modified over 8 years ago
1
Penn ESE534 Spring2010 -- DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming
2
Penn ESE534 Spring2010 -- DeHon 2 Today Retiming Demand –Folded Computation –Logical Pipelining –Physical Pipelining Retiming Supply –Technology –Structures –Hierarchy
3
Penn ESE534 Spring2010 -- DeHon 3 Retiming Demand
4
Image Processing Many operations can be described in terms of 2D window filters –Compute value as a weighted sum of the neighborhood –Out[x][y]=in[x][y]*c[1][1]+in[x- 1][y]c[0][1]+in[x][y- 1]*c[1][0]+in[x+1][y]*c[2][1]+in[x][y+1]*c[1][2 ]…. Blurring, edge and feature detection, object recognition and tracking, motion estimation, VLSI design rule checking Penn ESE534 Spring2010 -- DeHon 4
5
Preclass 2 Describes window computation: 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 5
6
Preclass 2 How many times is each in[x][y] used? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 6
7
Preclass 2 Sequentialized on one multiplier, –Distance between in[x][y] uses? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 7
8
Window Penn ESE534 Spring2010 -- DeHon 8
9
9 Parallel Scaling/Spatial Sum Compute entire window at once –More hardware, fewer cycles
10
Preclass 2 Unroll inner loop, –Distance between in[x][y] uses? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2010 -- DeHon 10
11
Fully Spatial What if we gave each pixel its own processor? Penn ESE534 Spring2010 -- DeHon 11
12
Preclass 1 How many registers on each link? Penn ESE534 Spring2010 -- DeHon 12
13
Penn ESE534 Spring2010 -- DeHon 13 Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle –MCNC benchmarks to 256 4-LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7)
14
Penn ESE680-002 Fall2007 -- DeHon 14 Long Interconnect Path What happens if one of these links ends up on a long interconnect path?
15
Penn ESE680-002 Fall2007 -- DeHon 15 Pipeline Interconnect Path To avoid cycle being limited by longest interconnect –Pipeline network
16
DeHon Dagstuhl 06361 16 Chips >> Cycles Chips growing Gate delays shrinking Wire delays aren’t scaling down Will take many cycles to cross chip
17
DeHon Dagstuhl 06361 17 Clock Cycle Radius Radius of logic can reach in one cycle (45 nm) –Radius 10 (preclass 20: L seg =5 50ps) Few hundred PEs –Chip side 600-700 PE 400-500 thousand PEs –100s of cycles to cross
18
Penn ESE680-002 Fall2007 -- DeHon 18 Pipelined Interconnect In what cases is this convenient?
19
Penn ESE680-002 Fall2007 -- DeHon 19 Pipelined Interconnect When might it pose a challenge?
20
Penn ESE680-002 Fall2007 -- DeHon 20 Long Interconnect Path What happens here? C C B A
21
Penn ESE680-002 Fall2007 -- DeHon 21 Long Interconnect Path Adds pipeline delays May force further pipelining of logic to balance out paths More registers C C B A
22
Penn ESE534 Spring2010 -- DeHon 22 Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle –MCNC benchmarks to 256 4-LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7) Reminder
23
Penn ESE534 Spring2010 -- DeHon 23 Flop Experiment #2 Pipeline and retime to HSRA cycle –place on HSRA –single LUT or interconnect timing domain –same MCNC benchmarks –average 4.7 registers/LUT [Tsu et al., FPGA 1999]
24
Penn ESE534 Spring2010 -- DeHon 24 Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism –Will have a distribution of retiming requirements –May differ from task to task –May vary independently from compute/interconnect requirements Another balance issue to watch –Balance with compute, interconnect Need a canonical way to measure –Like Rent?
25
Penn ESE534 Spring2010 -- DeHon 25 Retiming Supply
26
Penn ESE534 Spring2010 -- DeHon 26 Optional Output Flip-flop (optionally) on output –flip-flop: 4-5K 2 –Switch to select: ~ 5K 2 –Area: 1 LUT (800K 1M 2 /LUT) –Bandwidth: 1b/cycle
27
Penn ESE534 Spring2010 -- DeHon 27 Output Single Output –Ok, if don’t need other timings of signal Multiple Output –more routing
28
Penn ESE534 Spring2010 -- DeHon 28 Input More registers (K ) –7-10K 2 /register+mux –4-LUT => 30-40K 2 /depth No more interconnect than unretimed –open: compare savings to additional reg. cost Area: 1 LUT (1M+d*40K 2 ) get Kd regs d=4, 1.2M 2 Bandwidth: K/cycle 1/d th capacity
29
Preclass 3 Penn ESE534 Spring2010 -- DeHon 29
30
Penn ESE534 Spring2010 -- DeHon 30 Some Numbers (memory) Unit of area = 2 (F=2 –[more next week] Register as stand-alone element 4K 2 –e.g. as needed/used last lecture Static RAM cell 1K 2 –SRAM Memory (single ported) Dynamic RAM cell (DRAM process) 100 2 Dynamic RAM cell (SRAM process) 300 2 Day 5
31
Penn ESE534 Spring2010 -- DeHon 31 Retiming Density LUT+interconnect 1M 2 Register as stand-alone element 4K 2 Static RAM cell 1K 2 –SRAM Memory (single ported) Dynamic RAM cell (DRAM process) 100 2 Dynamic RAM cell (SRAM process) 300 2 Can have much more retiming memory per chip if put it in large arrays –…but then cannot get to it as frequently
32
Penn ESE534 Spring2010 -- DeHon 32 Retiming Structure Concerns Area: 2 /bit Throughput: bandwidth (bits/time) Energy
33
Penn ESE534 Spring2010 -- DeHon 33 Just Logic Blocks Most primitive –build flip-flop out of logic blocks I D*/Clk + I*Clk Q Q*/Clk + I*Clk –Area: 2 LUTs (800K 1M 2 /LUT each) –Bandwidth: 1b/cycle
34
Penn ESE534 Spring2010 -- DeHon 34 Separate Flip-Flops Network flip flop w/ own interconnect can deploy where needed requires more interconnect Vary LUT/FF ratio Arch. Parameter Assume routing inputs 1/4 size of LUT Area: 200K 2 each Bandwidth: 1b/cycle
35
Penn ESE534 Spring2010 -- DeHon 35 Virtex SRL16 Xilinx Virtex 4-LUT –Use as 16b shiftreg Area: ~1M 2 /16 60K 2 /bit –Does not need CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity
36
Penn ESE534 Spring2010 -- DeHon 36 Register File Memory Bank From MIPS-X –1K 2 /bit + 500 2 /port –Area(RF) = (d+6)(W+6)(1K 2 +ports* 500 2 ) w>>6,d>>6 I+o=2 => 2K 2 /bit w=1,d>>6 I=o=4 => 35K 2 /bit –comparable to input chain More efficient for wide-word cases
37
Penn ESE534 Spring2010 -- DeHon 37 Xilinx CLB Xilinx 4K CLB –as memory –works like RF Area: 1/2 CLB (640K 2 )/16 40K 2 /bit –but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) –1/16 th capacity
38
Penn ESE534 Spring2010 -- DeHon 38 Memory Blocks SRAM bit 1200 2 (large arrays) DRAM bit 100 2 (large arrays) Bandwidth: W bits / 2 cycles –usually single read/write –1/2 A th capacity
39
Penn ESE534 Spring2010 -- DeHon 39 Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 –640b, 9Kb, 144Kb Can put 1M/1K 1K bits in space of 4-LUT –Trade few 4-LUTs for considerable memory
40
Penn ESE534 Spring2010 -- DeHon 40 Hierarchy/Structure Summary “Memory Hierarchy” arises from area/bandwidth tradeoffs –Smaller/cheaper to store words/blocks (saves routing and control) –Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) –High bandwidth out of shallow memories –Applications have mix of retiming needs
41
Penn ESE534 Spring2010 -- DeHon 41 Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16) Embedded RAMs (9Kb,36Kb) Interface off-chip DRAM (~0.1—1Gb) No retiming in interconnect –….yet
42
Penn ESE534 Spring2010 -- DeHon 42 Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) –Lots of pipelining in memory path… Reorder Buffer (4—32) Architected RF (16, 32, 128) Actual RF (256, 512…) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (~10-100Gb)
43
Admin Final Exercise –Basic statement out –Will continue to refine and detail –One arch. Desc. to add by next Monday No new homework grading Reading for Wednesday on web Penn ESE534 Spring2010 -- DeHon 43
44
Penn ESE534 Spring2010 -- DeHon 44 Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) –Within design, among tasks Retiming requirements vary independently of compute, interconnect requirements (balance) Wide variety of retiming costs –100 2 1M 2 Routing and I/O bandwidth –big factors in costs Gives rise to memory (retiming) hierarchy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.