Download presentation
Presentation is loading. Please wait.
Published byBrendan Shields Modified over 9 years ago
1
A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA
2
2 Outline Motivation Stream programming FPGA Problem Stream Folding Results Conclusion 2
3
Stream Programming Paradigm Programs expressed as stream graphs Streams: Sequence of data elements Actor: Functions applied to streams Independent actors with explicit communication Regular and repeating computation 3 Actor/Filter Streams
4
FPGA FPGAs are widely available as programmable coprocessors Opportunities to exploit FPGA-based acceleration Multimedia, networking, graphics, and security codes 4
5
Problem Maximizing throughput subject to Area and latency constraints Resolving bottleneck actors The replicated filters do not require resynthesis 5
6
Motivating Example 6
7
7
8
8
9
9 Outline Motivation Stream programming FPGA Problem Stream Folding Results Conclusion 9
10
Area/Throughput Design Folding 1 foreach Filter f in S do 2 workFactor[f] = f.latency.S.runs(f); 3 designPointArea + = f.area.workFactor[f]; 4 scaleLimit = min f.hasState (1/workFactor[f]); 5 scaling = min(AREA/designPointArea, scaleLimit); 6 foreach Filter f in S do 7 replication[f] = workFactor[f].scaling; 8 while area(replication) > AREA do 9 replication = reduceThroughput(replication); 10
11
Calculating Throughput 11
12
Calculating Latency FPGAs that are coupled to host processors Initiation interval (DMA) Replication improves throughput, it often increases the latency! Major factors for latency variation Non-periodic data arrival Data-token reordering Local congestion 12
13
Latency constrained design folding 1 latConf= null ; T = ∞; 2 while throughput(thrConf) ≤ T do 3 if feasibleImprovement(thrConf) then 4 candidates = simAnnealing(thrConf, T); 5 foreach candidate in candidates do 6 if throughput(candidate) < T then 7 latConf = candidate; 8 T = throughput(latConf); 9 thrConf = reduceThroughput(thrConf); 10 return latConf 13
14
Results Benchm ark Minimum areaBest throughputConstrained design LUTsLatencyIILUTsLatencyIILUTsLatencyII Constrai nt Run time MatrixM ult1498480197618185345581757 Latency ≤ 1751.14s Serpent3028102743878773230539014 Latency ≤ 9100.73s FFT23761011993433707642395308687 AREA ≤ 4000034.7s FMRadio374583713987564371136251137120 AREA ≤ 650001.01s DCT4575234931372563491915043492 AREA ≤ 1200000.73s BitonicS ort4392010423131760104214740012822 AREA ≤ 5000018.3s Syntheti c350309135159905042149030947 AREA ≤ 15000.43s 14
15
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.