1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA This work was supported in part by NSF CNS
2/21 Chen Huang UC Riverside Outline Haar-feature based object detection algorithm Custom design space exploration: Feature mapping problem Experimental results
3/21 Chen Huang UC Riverside Original image Scaled images Haar-Feature based object detection algorithm (320 – 20) * (240 – 20) = 66,000 sub-windows X axis Y axis Movement of sub-window Faces detected on different scales … 20x20 sub- window Face found
4/21 Chen Huang UC Riverside Face detection in sub-window Fail Pass Facial Haar features Calculate Haar-feature value: Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B) Constant time Pixel_Sum calculation Pixel_Sum(R1) = P4 - P2 - P3 + P1 = Original image Integral Image p1p2 p3p4 R1 Need 4 corner values Stores Pixel sum of Rect(from top-left corner to this point) P4 P2 P3 P1 20 x 20 sub-window
5/21 Chen Huang UC Riverside Cascade decision process Frontal-face has 2000 features S1 2 features S2 5 features S3 16 features S features Divided into multiple stages …… pass Face detected pass Reject Fail Fail any stage will reject current sub-window
6/21 Chen Huang UC Riverside Algorithm FPGA implementation Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) ClassifierImage scaler 20 x 20 Sub- window Haar feature calculation/decision Frame grabber Video in FPGA
7/21 Chen Huang UC Riverside Integral image and Classifier Frame grabber Video in Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) Classifier Image scaler Classifier Integral Image Buffer (20 x bit register file) a1 a2 a3 a4b1 b2 b3 b4c1 c2 c3 c4 0 Feature threshold > Left value Right value Feature value mux + multiply by constant x2 x3 +(Feature sum) Rect sum Data delivery
8/21 Chen Huang UC Riverside Communication bottleneck A classifier port …… 20 x 20 Integral image 400-to-1 mux 400-to-1 17-bit MUX: 2300 LUTs 12 MUXes: 27,600 LUTs 40% of Virtex5 110T(69,120) General communication architecture Drawbacks: Does not scale well for multiple classifiers Wire congestion problem
9/21 Chen Huang UC Riverside Integral image CF1CF2CF3CF4 Multiple Classifiers Custom communication architecture for multi-classifier mux CF1 CF2 CF3 CF4 Classifier number Feature number
10/21 Chen Huang UC Riverside Integral image CF1CF2CF3CF4 Multiple Classifiers Custom communication architecture for multi-classifier CF1_port1CF2_port9CF3_port mux 9-1 mux24-1 mux16-1 mux CF4_port2 Custom communication architecture Classifier number Feature number CF1 CF2 CF3 CF4
11/21 Chen Huang UC Riverside Stage 1 Feature mapping problem Mapping 26 features into 4 Classifiers Stage and feature CF1 CF2CF3 CF4 5 Classifier Stage 1 Stage 2 Stage n pass Object found Reject Fail Stage Stage 3 Features CF1CF2CF3CF4
12/21 Chen Huang UC Riverside Feature mapping problem Swap Migrate #possible mapping grows exponentially with #features Simulated Annealing neighbor Total stage delay Total wire number PerformanceSize Objective: Min (Total stage delay * Total wire number) 1 million iterations (30 min) Mapping 26 features into 4 Classifiers Stage and feature CF1 CF2CF3 CF4 Stage 3 Stage 2 Stage Classifier CF1CF2CF3CF4
13/21 Chen Huang UC Riverside BRAM Select Automatic VHDL code generation Scheduling: Integral Image MUX Classifier 1 Feature mapping: 1, 4, 66, 3 (needs entry: 5, 24, 46, 92) Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout); C1: classifier port map(dout, …); Bram1: bram generic map(2, 1, 4, 3, …) Port map(…., select); Structural RTL code for communication components dout
14/21 Chen Huang UC Riverside Review of custom design space exploration Object detection application Custom design space exploration Program analysis Design exploration Design generation Resource constraints, performance requirements Map to different FPGAs Execution time Pareto design points Size Different number of classifiers Communication bottleneck mux Feature mapping problem
15/21 Chen Huang UC Riverside Experiment scenarios Different implementations Desktop: Pentium4 3.0 GHz fixed-point C FPGA: 1 CF(1 mux), 1 CF(3 mux), 1 CF(6 mux), 1 CF, 2 CF, 4 CF, 8 CF, 16 CF on Xilinx Virtex LX 50T, LX110T, and LX155T Feature sets Face: 2135 features Eye: 1066 features Sample images Face(simple) Face(complex) Eye Classifier 12 ports
16/21 Chen Huang UC Riverside Experiment: FPGA resource utilization General comm. architecture Custom comm. architecture LX50T.(29,000) LX100T.(69,000) LX155T.(97,000) Map to different Xilinx Virtex5 FPGAs Communication architecture mux Classifier number 24-1 mux 9-1 mux 24-1 mux 16-1 mux CF (6 mux) 1 CF (12 mux) 2 CF4 CF8 CF16 CF Design size (number of LUTS) Comms Static 1 CF (3 mux) 1 CF (1 mux)
17/21 Chen Huang UC Riverside Components' timing info Image scaler Buffer controller Classifier 65 Mhz 11 cycles/window 65 Mhz (3+examined features/#CF) cycles/window 130 Mhz 6 cycles/pixel Frame/sec Performance upper bound (110 fps) Performance of different components minmax Frame grabber Video in Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) ClassifierImage scaler Xilinx Virtex5 110T FPGA
18/21 Chen Huang UC Riverside Performance comparison Upper bound FPGA implementations are 0.6 to 25X faster than desktop C Desktop 1 CF (1 mux) 1 CF (3 mux) 1 CF (6 mux) 1 CF 2 CF 4 CF 8 CF Performance (frame/sec.) Face(complex) Face(simple) 16 CF Eye Pentium GHz (determined by buffer controller)
19/21 Chen Huang UC Riverside Comparison to previous work Compared to Cho’s [FPGA 09] implementation of the same algorithm with 320x240 pixels on the same FPGA. Size(LUTs)Performance(fps) Cho's(1 CF)64, Ours(1 CF)45, Cho's(3 CFs)84, Ours(16 CFs)77, More scalable due to custom design space exploration 3x faster with 8% less LUTs
20/21 Chen Huang UC Riverside Video Demo
21/21 Chen Huang UC Riverside Conclusions Effectively implemented object detection algorithm on a modern series of FPGAs Custom design space exploration is necessary for complex applications Future work: Implement more applications using custom search/optimization Thank you!