Download presentation
Presentation is loading. Please wait.
Published byRandolf Thompson Modified over 9 years ago
1
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University of Florida Presentation by John Potts, University of Guelph
2
2 Outline ● Introduction – What is a sliding-window application? – Justification ● Background – Applications ● Methodologies ● Results ● Analysis ● Conclusions
3
3 Introduction: Sliding-Window Applications ● What is a Sliding-Window Application? ● 2-Dimensional Signal Analysis ● x by y image, n by m kernal or window
4
4 Introduction: Sliding-Window Applications
5
5 Introduction - Justification ● Computing Architectures tending towards parallelism and heterogeneity ● GPUs are common ● Multitude of accelerator options available ● Metrics for different devices vary widely with applications ● Often many Pareto-optimal solutions ● Study focuses on a particular application type and two particular design criteria
6
6 Introduction: Devices Devices tested: ● Altera Stratix III E260 FPGA ● NVIDIA GeForce GTX 295 GPU using CUDA framework ● Quad-core xeon W3520 using OpenCL multicore framework ● Single Chip Systems also examined
7
7 Background: Previous Work ● Application performance for FPGAs and GPUs – Sinha et al.: feature tracker on GPU – Porter et al.: stereo matching algorithms on FPGA ● FPGA, GPU and CPU comparisons – Baker et al.: matched filter algorithm, Cell Processor for performance and energy, GPU for performance per dollar – Pauwels et al.: Vision-based algorithms, FPGAs best for single stage algorithms only ● Different use cases: – Cope et al.: 2D convolution and colour correction, performance dependant on kernal size – Asano et al.: CPU, GPU and FPGA for applications of 2D filter, SAD stereo vision disparity, k-means clustering
8
8 Background: Improvements offered by this study ● Study provides a more in-depth analysis of sliding- window applications ● Wider range of image and kernal sizes ● Presents a generalized circuit architecture ● Optimizations deliver real-time sliding-window processing of HD video on single GPU or FPGA ● Evaluates a new application based on Information Theoretic Learning
9
9 Background: Applications ● Applications where the kernal is fully immersed ● SAD – Sum of Absolute Differences ● 2D Convolution ● Correntropy ● 2D FFT – GPU and Multicore only
10
10 Applications: Sum of Absolute Differences ● Detect a degree of similarity between images ● Eg: security system ● Operation ● Output: structure of size (x-n+1)x(y-m+1)
11
11 Applications: 2D Convolution ● Used in digital signal processing, scientific computing, small to high-performance embedded systems ● Operation ● Equation: ● Common Optimization
12
12 Applications: Correntropy ● Measure of similarity based on Information Theoretic Learning ● Many possible applications, study focuses on one similar to SAD ● Equation: ● Operation
13
13 Methodology: FPGA Circuit Architecture:
14
14 Methodology: FPGA ● Uses a window generator to reduce bandwidth requirements ● Controller and host software transfers image, initializes, polls, reads output ● SAD implementation ● 2D Convolution
15
15 Methodology: FPGA ● Correntropy
16
16 Methodology: FPGA Resources LUTsRegistersBlock Memory Bits DSP Blocks SAD137,260156,3772,256,4640 2D Convolution: Fixed point 33,54757,1221,601,104738 2D Convolution: Floating Point 129,024126,8211,633,872676 Correntropy141,633143,1372,256,4640
17
17 Methodology: GPU ● Uses Specialized memory organisation ● a x b output pixels, 64x32 selected ● Macroblock size balances between threads per block and memory bank conflicts. 2X2 chosen ● SAD: calculated between kernal and four windows in the corresponding macroblock
18
18 Methodology: GPU ● 2D Convolution: Similar to SAD – Frequency domain also implemented (2D FFT) ● Correntropy: SAD with extra step – Challenge: locating maximum similarity values
19
19 Methodology: Multicore ● Utilized OpenCL parallel programming standard ● Optimizations focused on minimizing communication between threads ● Implementation consists of straightforward specification of the window function
20
20 Results ● Results examined include FPS, speedup analysis, energy efficiency ● Single chip systems such as APUs and standalone FPGA examined – Upper bound estimates found by removing PCIe transfer times ● Sequential C++ implementations used as baseline ● Implementations evaluated for 480p, 720p, 1080p video ● Kernal sizes of 4x4, 9x9, 16x16, 25x25 evaluated for all applications, also 36x36 and 45x45 for SAD and correntropy
21
21 Results: Sum of Absolute Difference
22
22 Results: 2D Convolution
23
23 Results: Correntropy
24
24 Results: Speedup
25
25 Results: Application Comparison
26
26 Results: Analysis ● GPU is best for smaller (4x4 and 9x9 kernals), equivalent in 16x16 ● FPGA speedup reached 240x, 45x, 298x over sequential baseline for SAD, 2D Convolution, Correntropy ● 2D Convolution: GPU-FFT was faster than FPGA ● FPGA implementations were near constant time due to pipelining, extra steps present as latency rather than throughput
27
27 Results: Single Chip Systems ● PCIe transfer times were as much as 65% of GPU execution time, 64% of FPGA execution time ● FPGA single chip is consistently ~2x PCIe ● At time of writing, GPU times minus PCIe transfer time is not a realistic representation as standalone or single chip GPU systems do not have nearly the capability of the device tested
28
28 Results: Energy Comparison Energy Consume for one frame
29
29 Results: Energy Comparison Theoretical Wattage for 30 fps:
30
30 Results: Energy Comparison ● Example application: Embedded system using correntropy for target tracking
31
31 Conclusions ● Performance and Energy requirements of sliding-window applications for a variety of devices and use cases ● FPGAs were faster except for small inputs ● FPGAs had lower power requirements ● Consistency of results suggests applicability to other sliding window applications
32
32 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.