1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

Advertisements

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.

Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.

Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.

Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design.

AdaBoost & Its Applications

Face detection Many slides adapted from P. Viola.

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,

EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.

Bohr Robot Group OpenCV ECE479 John Chhokar J.C. Arada Richard Dixon.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

The Viola/Jones Face Detector Prepared with figures taken from “Robust real-time object detection” CRL 2001/01, February 2001.

HCI Final Project Robust Real Time Face Detection Paul Viola, Michael Jones, Robust Real-Time Face Detetion, International Journal of Computer Vision,

Rapid Object Detection using a Boosted Cascade of Simple Features

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

1 FPGA Lab School of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701, U.S.A. An Entropy-based Learning Hardware Organization.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Robust Real-Time Object Detection Paul Viola & Michael Jones.

Viola and Jones Object Detector Ruxandra Paun EE/CS/CNS Presentation

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.

FACE DETECTION AND RECOGNITION By: Paranjith Singh Lohiya Ravi Babu Lavu.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Frank Vahid, 1 Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation Bailey Miller*, Frank Vahid*, Tony Givargis**

Face Detection using the Viola-Jones Method

Human tracking and counting using the KINECT range sensor based on Adaboost and Kalman Filter ISVC 2013.

1 Electronics Lab, Physics Dept., Aristotle Univ. of Thessaloniki, Greece 2 Micro2Gen Ltd., NCSR Demokritos, Greece 17th IEEE International Conference.

Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

ECE532 Final Project Demo Disparity Map Generation on a FPGA Using Stereoscopic Cameras ECE532 Final Project Demo Team 3 – Alim, Muhammad, Yu Ting.

Detecting Pedestrians Using Patterns of Motion and Appearance Paul Viola Microsoft Research Irfan Ullah Dept. of Info. and Comm. Engr. Myongji University.

Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

Implementing Codesign in Xilinx Virtex II Pro Betim Çiço, Hergys Rexha Department of Informatics Engineering Faculty of Information Technologies Polytechnic.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.

Implementation of Finite Field Inversion

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

DIEGO AGUIRRE COMPUTER VISION INTRODUCTION 1. QUESTION What is Computer Vision? 2.

Terrorists Team members: Ágnes Bartha György Kovács Imre Hajagos Wojciech Zyla.

ECE738 Advanced Image Processing Face Detection IEEE Trans. PAMI, July 1997.

REAL TIME FACE DETECTION

An Optoelectronic Neural Network Packet Switch Scheduler K. J. Symington, A. J. Waddie, T. Yasue, M. R. Taghizadeh and J. F. Snowdon.

1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.

Lecture 09 03/01/2012 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.

The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.

Bibek Jang Karki. Outline Integral Image Representation of image in summation format AdaBoost Ranking of features Combining best features to form strong.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.

Hand Gesture Recognition Using Haar-Like Features and a Stochastic Context-Free Grammar IEEE 高裕凱陳思安.

By: David Gelbendorf, Hila Ben-Moshe Supervisor : Alon Zvirin

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.

1 ® ® Agenda 8:30 a.m.Introduction to The MathWorks, Xilinx, and Avnet 9:00 a.m.Video System Design with Simulink 9:45 a.m.Break 10:00 a.m.FPGA Implementation.

A Brief Introduction on Face Detection Mei-Chen Yeh 04/06/2010 P. Viola and M. J. Jones, Robust Real-Time Face Detection, IJCV 2004.

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.

Hand Detection with a Cascade of Boosted Classifiers Using Haar-like Features Qing Chen Discover Lab, SITE, University of Ottawa May 2, 2006.

Implementation of Real Time Image Processing System with FPGA and DSP Presented by M V Ganeswara Rao Co- author Dr. P Rajesh Kumar Co- author Dr. A Mallikarjuna.

Detection, Tracking and Recognition in Video Sequences Supervised By: Dr. Ofer Hadar Mr. Uri Perets Project By: Sonia KanOra Gendler Ben-Gurion University.

2. Skin - color filtering.

Introduction to Programmable Logic

An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

ADABOOST(Adaptative Boosting)

Portable SystemC-on-a-Chip

Presentation transcript:

1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering University of California, Riverside, USA This work was supported in part by NSF CNS

2/21 Chen Huang UC Riverside Outline  Haar-feature based object detection algorithm  Custom design space exploration: Feature mapping problem  Experimental results

3/21 Chen Huang UC Riverside Original image Scaled images Haar-Feature based object detection algorithm (320 – 20) * (240 – 20) = 66,000 sub-windows X axis Y axis Movement of sub-window Faces detected on different scales … 20x20 sub- window Face found

4/21 Chen Huang UC Riverside Face detection in sub-window Fail Pass Facial Haar features Calculate Haar-feature value: Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B) Constant time Pixel_Sum calculation Pixel_Sum(R1) = P4 - P2 - P3 + P1 = Original image Integral Image p1p2 p3p4 R1 Need 4 corner values Stores Pixel sum of Rect(from top-left corner to this point) P4 P2 P3 P1 20 x 20 sub-window

5/21 Chen Huang UC Riverside Cascade decision process Frontal-face has 2000 features S1 2 features S2 5 features S3 16 features S features Divided into multiple stages …… pass Face detected pass Reject Fail Fail any stage will reject current sub-window

6/21 Chen Huang UC Riverside Algorithm FPGA implementation Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) ClassifierImage scaler 20 x 20 Sub- window Haar feature calculation/decision Frame grabber Video in FPGA

7/21 Chen Huang UC Riverside Integral image and Classifier Frame grabber Video in Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) Classifier Image scaler Classifier Integral Image Buffer (20 x bit register file) a1 a2 a3 a4b1 b2 b3 b4c1 c2 c3 c4 0 Feature threshold > Left value Right value Feature value mux + multiply by constant x2 x3 +(Feature sum) Rect sum Data delivery

8/21 Chen Huang UC Riverside Communication bottleneck A classifier port …… 20 x 20 Integral image 400-to-1 mux 400-to-1 17-bit MUX: 2300 LUTs 12 MUXes: 27,600 LUTs 40% of Virtex5 110T(69,120) General communication architecture Drawbacks: Does not scale well for multiple classifiers Wire congestion problem

9/21 Chen Huang UC Riverside Integral image CF1CF2CF3CF4 Multiple Classifiers Custom communication architecture for multi-classifier mux CF1 CF2 CF3 CF4 Classifier number Feature number

10/21 Chen Huang UC Riverside Integral image CF1CF2CF3CF4 Multiple Classifiers Custom communication architecture for multi-classifier CF1_port1CF2_port9CF3_port mux 9-1 mux24-1 mux16-1 mux CF4_port2 Custom communication architecture Classifier number Feature number CF1 CF2 CF3 CF4

11/21 Chen Huang UC Riverside Stage 1 Feature mapping problem Mapping 26 features into 4 Classifiers Stage and feature CF1 CF2CF3 CF4 5 Classifier Stage 1 Stage 2 Stage n pass Object found Reject Fail Stage Stage 3 Features CF1CF2CF3CF4

12/21 Chen Huang UC Riverside Feature mapping problem Swap Migrate #possible mapping grows exponentially with #features Simulated Annealing neighbor Total stage delay Total wire number PerformanceSize Objective: Min (Total stage delay * Total wire number) 1 million iterations (30 min) Mapping 26 features into 4 Classifiers Stage and feature CF1 CF2CF3 CF4 Stage 3 Stage 2 Stage Classifier CF1CF2CF3CF4

13/21 Chen Huang UC Riverside BRAM Select Automatic VHDL code generation Scheduling: Integral Image MUX Classifier 1 Feature mapping: 1, 4, 66, 3 (needs entry: 5, 24, 46, 92) Mux1: mux4 port map(II(5), II(24), II(46), II(92), select, dout); C1: classifier port map(dout, …); Bram1: bram generic map(2, 1, 4, 3, …) Port map(…., select); Structural RTL code for communication components dout

14/21 Chen Huang UC Riverside Review of custom design space exploration Object detection application Custom design space exploration Program analysis Design exploration Design generation Resource constraints, performance requirements Map to different FPGAs Execution time Pareto design points Size Different number of classifiers Communication bottleneck mux Feature mapping problem

15/21 Chen Huang UC Riverside Experiment scenarios  Different implementations Desktop: Pentium4 3.0 GHz fixed-point C FPGA: 1 CF(1 mux), 1 CF(3 mux), 1 CF(6 mux), 1 CF, 2 CF, 4 CF, 8 CF, 16 CF on Xilinx Virtex LX 50T, LX110T, and LX155T  Feature sets Face: 2135 features Eye: 1066 features  Sample images Face(simple) Face(complex) Eye Classifier 12 ports

16/21 Chen Huang UC Riverside Experiment: FPGA resource utilization General comm. architecture Custom comm. architecture LX50T.(29,000) LX100T.(69,000) LX155T.(97,000) Map to different Xilinx Virtex5 FPGAs Communication architecture mux Classifier number 24-1 mux 9-1 mux 24-1 mux 16-1 mux CF (6 mux) 1 CF (12 mux) 2 CF4 CF8 CF16 CF Design size (number of LUTS) Comms Static 1 CF (3 mux) 1 CF (1 mux)

17/21 Chen Huang UC Riverside Components' timing info Image scaler Buffer controller Classifier 65 Mhz 11 cycles/window 65 Mhz (3+examined features/#CF) cycles/window 130 Mhz 6 cycles/pixel Frame/sec Performance upper bound (110 fps) Performance of different components minmax Frame grabber Video in Buffer controller Integral image Rectangle drawer Video out (objects in rectangles) ClassifierImage scaler Xilinx Virtex5 110T FPGA

18/21 Chen Huang UC Riverside Performance comparison Upper bound FPGA implementations are 0.6 to 25X faster than desktop C Desktop 1 CF (1 mux) 1 CF (3 mux) 1 CF (6 mux) 1 CF 2 CF 4 CF 8 CF Performance (frame/sec.) Face(complex) Face(simple) 16 CF Eye Pentium GHz (determined by buffer controller)

19/21 Chen Huang UC Riverside Comparison to previous work Compared to Cho’s [FPGA 09] implementation of the same algorithm with 320x240 pixels on the same FPGA. Size(LUTs)Performance(fps) Cho's(1 CF)64, Ours(1 CF)45, Cho's(3 CFs)84, Ours(16 CFs)77, More scalable due to custom design space exploration 3x faster with 8% less LUTs

20/21 Chen Huang UC Riverside Video Demo

21/21 Chen Huang UC Riverside Conclusions  Effectively implemented object detection algorithm on a modern series of FPGAs  Custom design space exploration is necessary for complex applications  Future work: Implement more applications using custom search/optimization Thank you!