6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro.

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems *1 Shorin KYO *1 Shin'ichiro OKAZAKI *2 Tamio ARAI *1 Media and Information Research Laboratories, NEC Corporation *2 School of Engineering, University of Tokyo

6 th /June, ISCA2005, 2/30NEC Corporation 1.Challenges of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluations 5. Summary Outline

6 th /June, ISCA2005, 3/30NEC Corporation Three Basic Requirements 1) High Performance 2) Cost/Power Efficiency 3) High Flexibility (Scalability and Versatility) Low cost Easy cooling (< 2 Watt) High Quality / Reliability Low EMI Able to handle the combination of [ applications × situations×targets ] 10 1 100 1000 Robustness GOPS Lane Marks Preceding obstacles Side/back obstacles Traffic signs, pedestrians Ex. Embedded Driver Asistant Systems Realtime Response

6 th /June, ISCA2005, 4/30NEC Corporation Applications × Situations × Targets Dynamic Back Up Aid Cross Traffic Warning Following Distance Warning Park Slot Measurement Backup Parking Assist Stop&Go Side Pre-Crash Cut-In Front Pre-Crash Lane Change Assist Pedestrian Protection Blind Spot Detection Drownsiness warning Traffic Sign Recognition

6 th /June, ISCA2005, 5/30NEC Corporation Control circuit Cost （ Die size / power consumption ） Operation circuit (peak) performanceFlexibility 100 Itanium Sparc64 SPE(CELL) FR1000 FR500 IMAP-CE, IMAPCAR CODEC LSI a) Desktop/Server CPU (GPPs) b) MIMDs (Multi-Cores) c) DSPs d) Highly parallel SIMDs e) Special purpose LSI % of Control Circuitry % of Operational Circuitry (Flexibility) (Performance) COR: Control versus Operational circuit Ratio 1) Performance (higher) 2) Cost (lower) 3) Flexibility (higher) Trading-off items

6 th /June, ISCA2005, 6/30NEC Corporation (a) GPPs (b) DSPs and MIMDs (c) Highly parallel SIMDs (d) Custom logics+DSP core (e) Custom logics only Flexibility Performance a) b) c) d) e) Ctrl. circuits Op. circuit Ctrl. circuits Op. Op. circuit Fixed Cost & Technology Constrain (a Technology Barrier) Flexibility gap Challenge of embedded image processors ⇒ Minimizing COR while overcoming the "Flexibility Gap" Overcoming the Flexibility Gap Ctrl.

6 th /June, ISCA2005, 7/30NEC Corporation 1.Challenge of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary Outline

6 th /June, ISCA2005, 8/30NEC CorporationIMAP-CE IMAP-1 IMAP-VISION 1990 1995 2000 2005 2010 1 0.1 10 100 40MHz, 32PE/Chip 15MHz, 8PE /Chip Peak Performance(GOPS) 100MHz, 128PE/Chip 4-Way VLIW,50GOPS 0.18um, 2 ～ 4Watt IMAP-2 40MHz, 64PE/Chip IMAPCAR 100MHz, 128PE/Chip 4-Way VLIW+MAC, 100GOPS (-40 ℃～ 85 ℃ ), 0.13 um, <2Watt 1000 IMAP Series Processors ISSCC’03 ISSCC’95 Year 11.0mm PE8 CP EXTIF DPLL IMAP-CE( 32.7M Tr, 0.18um ) (PE8: eight PEs integration block) CAMP’97

6 th /June, ISCA2005, 9/30NEC Corporation Block Diagram and Features Video IN Video OUT P$,D$,STK RAM EMEM Host Processor Control Processor (CP) 4 Way VLIW PE SR0 SR1 SR2 IMEM External Mem. I/F 12.8 GByte/s 0.8 GByte/s 0 1 127 SR3 128 EMEM ADD MUL RDU 24 x 8b General Purpose Registers To/Fr other PEs To/Fr IMEM LSU COMM To/Fr CP LOG 4)128 individual RAM blocks configuration 5)1DC (One Dimensional C) + “Line methods” 6)Enhanced PE instruction set design for 1DC 1)100MHz 128 4Way VLIW linear array PEs 2)Two level memory architecture + user DMA 3)Automated mapping of image data to each PE PE one pixel data IMEM of one PE column(s) of image source (image) data PE CP instruction broadcast (SIMD) SDRAM/ SSRAM 2KB 128 64MB ～ ALUx1,MULx1,LOGx1,LSUx1

6 th /June, ISCA2005, 10/30NEC Corporation Memory Access Pattern Categories Input Image X (RNO) Recursive Neigh. Op. Output Image Y High-level Decision Local Feature based Discrimination Measurements Low-level Image Processing Intermediate-level Image Processing pixels symbols Output Image Y Input Image X Point Op. (PO) Input Image X Output vector / scalar V Statistical Op. (SO) Input Image X Output vector / scalar V Object Op. (OO) Higher level Feature extraction Low-level Feature Extraction Output Image Y Input Image X Global Op. (GlO) Output Image YInput Image X Geometric. Op. (GeO) Output Image YInput Image X Local Neigh. Op. (LNO) Pre-processing Sensors Image processing Image recognition E.R.Komen: Low-level Image Processing Architectures, Ph.d Thesis, TUD,Netherlands, 1990. P.P.Jonker: Architectures for Multidimensional Low- and Intermidiate Level Image rocessing, Proc. of IAPR Workshop on Machine Vision Applications (MVA'90), pp.307--316, 1990. ex. affine ex. 2d-filters,NN ex. labelling/propagation ex. distance trans. ex. FFT ex. histogram

6 th /June, ISCA2005, 11/30NEC Corporation Recursive Data dependent Conventional continous (or strided) address data supply (ex. streaming data supply) is not sufficient for parallelizing most memory access patterns been required PO ○ LNO ○ SO × GlO × GeO × RNO × OO × Global Completely local Local Neighborhood Unified RAM PE SIMD + VLIW PEs Memory Access Pattern Parallelization Issue

6 th /June, ISCA2005, 12/30NEC Corporation Unconstrained pixel update Constrained pixel update Statically constraineddynamically constrained update location is statically predictable update location must be dynamically determined No Yes SO, GlO,GeO － PO, LNO RNO OO － Locality slant-systolic PE autonomous PE row-systolic PE row-wise (PUL) PE image requires one RAM block / PE configuration Memory Access Pattern Parallelization Design (PUL: Pixel Updating Line) Line Methods

6 th /June, ISCA2005, 13/30NEC Corporation 90 degree rotation Thinning Connect component labeling Line Methods (1) ー Combination of PULs ー PE + Propagation PE ++ 2 times

6 th /June, ISCA2005, 14/30NEC Corporation N/2 ～ N time speedup by N PEs *1 *2 *1: When under an unified RAM approach *2: When using the memory array architecture Line Methods (2) ー Expected Speedup ー (when using N PEs)

6 th /June, ISCA2005, 16/30NEC Corporation int d, e; sep char a,b; sep char c,ary[256]; One (vector like) data structure and six operators 1DC: An Extended C Language Correspondence between parallelizing techniques and the 1DC syntax.

6 th /June, ISCA2005, 17/30NEC Corporation Sequential Languages (Ex. C) for (y=0; y < {number of lines} ； y++) for (x=0; x < {number of columns}; x++)......... When using 1DC, skip the {number of columns} loop for (y=0; y < {number of lines} ； y++)........... y=0y=120 y=200y= {number of lines} Ex. An Edge Detection Filter 1DC: Line-wise Parallel Operation

6 th /June, ISCA2005, 18/30NEC Corporation src[i] src[i+1] ＋ a8 a6 ････ b7b8 b6 ････ c7c8 c6 ････ src[i-1] a7 ＋････ a7+b7+c7 ↓ csum a8+b8+c8 a6+b6+c6 ＋＋ = src[256], dst[256]; sep uchar src[256], dst[256]; ave33( ){ int i; sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; } Summing three lines at the same time Average Filter in 1DC (1)

6 th /June, ISCA2005, 19/30NEC Corporation ave33( ){ int i; sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; } csum :<csum ＋････ :>csum ＋ ↓ dst[i] a9+b9+c9 a7+b7+c7 a5+b5+c5 a6+b6+c6 a7+b7+c7 a8+b8+c8 a6+b6+c6 a7+b7+c7 a8+b8+c8 a5+b5+c5 a6+b6+c6 a7+b7+c7 a7+b7+c7 a6+b6+c6 a7+b7+c7 a8+b8+c8 a9+b9+c9 a8+b8+c8 ＋＋ = Neigh. ref.(:>,:<) and “ + ” Average Filter in 1DC (2)

6 th /June, ISCA2005, 20/30NEC Corporation Fast PE grouping PE array Systolic PE array Slant Autonomous PE array Row Toward Efficient Execution of 1DC Codes Pipelined data exchange Fast left/right referencing 1DC program 1DC compiler / linker Fast index addressing Video IN Video OUT P$,D$,STK RAM Host Processor Control Processor (CP) 4 Way VLIW PE SR0 SR1 SR2 IMEM External Mem. I/F 0 1 127 SR3 128 SDRAM/SSRAM

6 th /June, ISCA2005, 21/30NEC Corporation Programming Environment Assign variables to sliders Timing measurement result for each source code line 1DC Source code window Real-time value tuning debugging Source image window Image recognition result window 1DC Optimizing Compiler 1DC Symbolic Debugger 1DC Source Code Library IMAP Assembler Linker IMAP-CE PCI board

6 th /June, ISCA2005, 23/30NEC Corporation Operation Group Kernels Flexibility against various memory access patterns Op. Grp.Kernel Name IPC POColor format trans. 1.40 LNO3x3 ave. filter 1.33 SOHistogram 1.66 GlOFFT 1.55 GeO90 degree rotation 1.23 RNODistance transform 1.52 OOConnected component labeling 1.40 speedupparallelism (max.128) IMAP-CE@100MHz, 1DC compiler codes GPP@2.4GHz, Intel C compiler codes Operation group kernels

6 th /June, ISCA2005, 24/30NEC Corporation namePurpose Add2 dyadic arithmetic GreyOpen33x3 grey morphology Gauss5 5x5 filter Mexican13 13x13 conv. Var5Oct 5x5 texture analysis Canny edge detection (3x3) Smoothing edge preserving smoothing (7x7) speed-up PO LNO ProcessorOp.Freq. PE # Peak Perf. P4(SIMD)2.4GHz 1PEx8x238.4 GOPS IMAP-CE100MHz 128PEx451.2 GOPS IMAP-CE GPP x 1/24 x 32 x 1.33 Flexibility against algorithmic complexity GOPS : in byte operation Highly Parallel vs. Sub-Word SIMD # of if-clause per pixel op. IMAP-CE@100MHz, 1DC compiler codes GPP@2.4GHz, MMX codes Benchmark kernels Only PO,LNO kernels are used due to the nature of MMX inst.

6 th /June, ISCA2005, 25/30NEC Corporation Compared with Some Recent Media Processors PE Image 128 bank memory PE (scratch pad memories) SRF of Imagine (Stanford) Frame Buffer of Morphosys (UC) Local Store of SPE(CELL:Sony) 2KB One to several banks On chip vector partitioning & chaining VIRAM (UCB), CODE (Stanford) static vector partitioning IMAP 1024 point 1D-FFT performance compared with other media processors PE Processor NameCycle count Word Size Die-sizePwr(W)Tech(um) Imagine(Float)21761612*1240.15 Morphosys226361616*1640.13 IMAP-CE(IMAPCAR)5000(3700)811*114(2)0.18(0.13) VIRAM52801615*1820.18

6 th /June, ISCA2005, 26/30NEC Corporation IMAP-CE@100MHz: use 1DC GPP@2.4GHz: use C A Real Application － Vehicle Detection － Flexibility at the application level Search Tracking vechicles Validate Lane Mark Detection four local windowsin max. six vehicles foreward looking camera

6 th /June, ISCA2005, 27/30NEC Corporation Processing time distribution The Uneven Workload Issue PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PE array fully utilized Partial activation of PE array during sequential validatation of each candidate area Search Validation

6 th /June, ISCA2005, 29/30NEC Corporation Summary Technology Barrier (c) (a) (b) (d) GPPs Highly parallel SIMD Media Extended DSPs Flexibility Performance (e) 1) High Performance 2) Low Cost/ High Reliability 3) High Flexibility Parallel and systolic algorithm design methodology + Hardware support of parallelizing methods + Extended C Compiler & GUI Debugger The IMAP approach Wired logics (+DSP core) Assembly programmed DSPs Flexibility Gap Embedded Image Recognition Processor

6 th /June, ISCA2005, 30/30NEC Corporation The END (Thank you for your attention)

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro.

Similar presentations

Presentation on theme: "6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems *1 Shorin KYO *1 Shin'ichiro.

Similar presentations

Presentation on theme: "6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems *1 Shorin KYO *1 Shin'ichiro."— Presentation transcript:

Similar presentations

About project

Feedback

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro.

Presentation on theme: "6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro."— Presentation transcript: