Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering.

Similar presentations


Presentation on theme: "1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering."— Presentation transcript:

1 1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA {chuang,vahid}@cs.ucr.edu This work was supported in part by NSF CNS-1016792

2 2/21 Chen Huang UC Riverside Outline  Haar-feature based object detection algorithm  Custom design space exploration: Feature mapping problem  Experimental results

3 3/21 Chen Huang UC Riverside Original image Scaled images Haar-Feature based object detection algorithm (320 – 20) * (240 – 20) = 66,000 sub-windows X axis Y axis 0 240 320 Movement of sub-window Faces detected on different scales … 20x20 sub- window Face found

4 4/21 Chen Huang UC Riverside Face detection in sub-window Fail Pass Facial Haar features Calculate Haar-feature value: Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B) Constant time Pixel_Sum calculation Pixel_Sum(R1) = P4 - P2 - P3 + P1 = 4 1 1 1 Original image Integral Image 1 2 3 2 4 6 3 6 9 p1p2 p3p4 R1 Need 4 corner values Stores Pixel sum of Rect(from top-left corner to this point) P4 P2 P3 P1 20 x 20 sub-window

5 5/21 Chen Huang UC Riverside Cascade decision process Frontal-face has 2000 features S1 2 features S2 5 features S3 16 features S22 212 features Divided into multiple stages …… pass Face detected pass Reject Fail Fail any stage will reject current sub-window

6 6/21 Chen Huang UC Riverside Algorithm FPGA implementation Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) ClassifierImage scaler 20 x 20 Sub- window Haar feature calculation/decision Frame grabber Video in FPGA

7 7/21 Chen Huang UC Riverside Integral image and Classifier Frame grabber Video in Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) Classifier Image scaler Classifier Integral Image Buffer (20 x 20 17-bit register file) a1 a2 a3 a4b1 b2 b3 b4c1 c2 c3 c4 0 Feature threshold > Left value Right value Feature value mux + multiply by constant x2 x3 +(Feature sum) Rect sum Data delivery

8 8/21 Chen Huang UC Riverside Communication bottleneck A classifier port …… 20 x 20 Integral image 400-to-1 mux 400-to-1 17-bit MUX: 2300 LUTs 12 MUXes: 27,600 LUTs 40% of Virtex5 110T(69,120) General communication architecture Drawbacks: Does not scale well for multiple classifiers Wire congestion problem

9 9/21 Chen Huang UC Riverside Integral image CF1CF2CF3CF4 Multiple Classifiers Custom communication architecture for multi-classifier 400-1 mux CF1 CF2 CF3 CF4 Classifier number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Feature number

10 10/21 Chen Huang UC Riverside Integral image CF1CF2CF3CF4 Multiple Classifiers Custom communication architecture for multi-classifier CF1_port1CF2_port9CF3_port7 24-1 mux 9-1 mux24-1 mux16-1 mux CF4_port2 Custom communication architecture Classifier number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Feature number CF1 CF2 CF3 CF4

11 11/21 Chen Huang UC Riverside 1 2 3 4 Stage 1 Feature mapping problem Mapping 26 features into 4 Classifiers Stage and feature CF1 CF2CF3 CF4 5 Classifier Stage 1 Stage 2 Stage n pass Object found Reject Fail 6 7 8 9 10 11 12 Stage 2 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Stage 3 Features CF1CF2CF3CF4

12 12/21 Chen Huang UC Riverside Feature mapping problem Swap Migrate #possible mapping grows exponentially with #features Simulated Annealing neighbor Total stage delay Total wire number PerformanceSize Objective: Min (Total stage delay * Total wire number) 1 million iterations (30 min) Mapping 26 features into 4 Classifiers Stage and feature CF1 CF2CF3 CF4 Stage 3 Stage 2 Stage 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Classifier CF1CF2CF3CF4

13 13/21 Chen Huang UC Riverside BRAM Select Automatic VHDL code generation Scheduling: Integral Image 5 24 46 92 MUX Classifier 1 Feature mapping: 1, 4, 66, 3 (needs entry: 5, 24, 46, 92) 1 4 3 1 2 3 4 2459246 2 Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout); C1: classifier port map(dout, …); Bram1: bram generic map(2, 1, 4, 3, …) Port map(…., select); Structural RTL code for communication components dout

14 14/21 Chen Huang UC Riverside Review of custom design space exploration Object detection application Custom design space exploration Program analysis Design exploration Design generation Resource constraints, performance requirements Map to different FPGAs Execution time Pareto design points Size Different number of classifiers Communication bottleneck 400-1 mux Feature mapping problem

15 15/21 Chen Huang UC Riverside Experiment scenarios  Different implementations Desktop: Pentium4 3.0 GHz fixed-point C FPGA: 1 CF(1 mux), 1 CF(3 mux), 1 CF(6 mux), 1 CF, 2 CF, 4 CF, 8 CF, 16 CF on Xilinx Virtex LX 50T, LX110T, and LX155T  Feature sets Face: 2135 features Eye: 1066 features  Sample images Face(simple) Face(complex) Eye Classifier 12 ports

16 16/21 Chen Huang UC Riverside Experiment: FPGA resource utilization General comm. architecture Custom comm. architecture LX50T.(29,000) LX100T.(69,000) LX155T.(97,000) Map to different Xilinx Virtex5 FPGAs Communication architecture 400-1 mux Classifier number 24-1 mux 9-1 mux 24-1 mux 16-1 mux 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 1 CF (6 mux) 1 CF (12 mux) 2 CF4 CF8 CF16 CF Design size (number of LUTS) Comms Static 1 CF (3 mux) 1 CF (1 mux)

17 17/21 Chen Huang UC Riverside Components' timing info Image scaler Buffer controller Classifier 65 Mhz 11 cycles/window 65 Mhz (3+examined features/#CF) cycles/window 130 Mhz 6 cycles/pixel Frame/sec 124 110 0.6 201 Performance upper bound (110 fps) Performance of different components minmax Frame grabber Video in Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) ClassifierImage scaler Xilinx Virtex5 110T FPGA

18 18/21 Chen Huang UC Riverside Performance comparison Upper bound FPGA implementations are 0.6 to 25X faster than desktop C 0 20 40 60 80 100 120 Desktop 1 CF (1 mux) 1 CF (3 mux) 1 CF (6 mux) 1 CF 2 CF 4 CF 8 CF Performance (frame/sec.) Face(complex) Face(simple) 16 CF Eye Pentium 4 3.0 GHz (determined by buffer controller)

19 19/21 Chen Huang UC Riverside Comparison to previous work Compared to Cho’s [FPGA 09] implementation of the same algorithm with 320x240 pixels on the same FPGA. Size(LUTs)Performance(fps) Cho's(1 CF)64,14317.5 Ours(1 CF)45,71319.3 Cho's(3 CFs)84,23228.8 Ours(16 CFs)77,05990.9 More scalable due to custom design space exploration 3x faster with 8% less LUTs

20 20/21 Chen Huang UC Riverside Video Demo http://www.youtube.com/watch?v=gkQVanU5P5U

21 21/21 Chen Huang UC Riverside Conclusions  Effectively implemented object detection algorithm on a modern series of FPGAs  Custom design space exploration is necessary for complex applications  Future work: Implement more applications using custom search/optimization Thank you!


Download ppt "1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering."

Similar presentations


Ads by Google