A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA
2 Outline Motivation Stream programming FPGA Problem Stream Folding Results Conclusion 2
Stream Programming Paradigm Programs expressed as stream graphs Streams: Sequence of data elements Actor: Functions applied to streams Independent actors with explicit communication Regular and repeating computation 3 Actor/Filter Streams
FPGA FPGAs are widely available as programmable coprocessors Opportunities to exploit FPGA-based acceleration Multimedia, networking, graphics, and security codes 4
Problem Maximizing throughput subject to Area and latency constraints Resolving bottleneck actors The replicated filters do not require resynthesis 5
Motivating Example 6
7
8
9 Outline Motivation Stream programming FPGA Problem Stream Folding Results Conclusion 9
Area/Throughput Design Folding 1 foreach Filter f in S do 2 workFactor[f] = f.latency.S.runs(f); 3 designPointArea + = f.area.workFactor[f]; 4 scaleLimit = min f.hasState (1/workFactor[f]); 5 scaling = min(AREA/designPointArea, scaleLimit); 6 foreach Filter f in S do 7 replication[f] = workFactor[f].scaling; 8 while area(replication) > AREA do 9 replication = reduceThroughput(replication); 10
Calculating Throughput 11
Calculating Latency FPGAs that are coupled to host processors Initiation interval (DMA) Replication improves throughput, it often increases the latency! Major factors for latency variation Non-periodic data arrival Data-token reordering Local congestion 12
Latency constrained design folding 1 latConf= null ; T = ∞; 2 while throughput(thrConf) ≤ T do 3 if feasibleImprovement(thrConf) then 4 candidates = simAnnealing(thrConf, T); 5 foreach candidate in candidates do 6 if throughput(candidate) < T then 7 latConf = candidate; 8 T = throughput(latConf); 9 thrConf = reduceThroughput(thrConf); 10 return latConf 13
Results Benchm ark Minimum areaBest throughputConstrained design LUTsLatencyIILUTsLatencyIILUTsLatencyII Constrai nt Run time MatrixM ult Latency ≤ s Serpent Latency ≤ s FFT AREA ≤ s FMRadio AREA ≤ s DCT AREA ≤ s BitonicS ort AREA ≤ s Syntheti c AREA ≤ s 14
Questions?