6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems 1 Shorin KYO 1 Shin'ichiro.

Slides:

Advertisements

Similar presentations

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

DSPs Vs General Purpose Microprocessors

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Computer Architecture Instruction-Level Parallel Processors

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Lecture 6: Multicore Systems

Computer Abstractions and Technology

The University of Adelaide, School of Computer Science

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Real-Time Video Analysis on an Embedded Smart Camera for Trafﬁc Surveillance Presenter: Yu-Wei Fan.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Characteristics of Realtime and Embedded Systems Chapter 1 6/10/20151.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Eye-RIS. Vision System sense – process - control autonomous mode Program stora.

Getting Started With DSP A. What is DSP? B. Which TI DSP do I use? Highest performance C6000 Most power efficient C5000 Control optimized C2000 TMS320C6000™

STARAN Parallel processor system hardware By KENNETH E. BATCHER Presented by Manoj k. Yarlagadda Manoj k. Yarlagadda.

04/04/20071 Image Understanding Architecture: Exploiting Potential Parallelism in Machine Vision.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.

Implementing Codesign in Xilinx Virtex II Pro Betim Çiço, Hergys Rexha Department of Informatics Engineering Faculty of Information Technologies Polytechnic.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

1 Mobileye: driving assistance on a chip ●Founded: 1999 ●185 employees located in Har-Hotzvim, Jerusalem. ●Develops computer vision algorithms and system-on-chip.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.

Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳宸.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Data/Frame Memory PE 0 PE 1 PE 2 PE 3 PE N … Control Instruction Memory Interconnect The SIMD Concept.

My Coordinates Office EM G.27 contact time:

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.

Hello world !!! ASCII representation of hello.c.

Operating Systems A Biswas, Dept. of Information Technology.

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Hiba Tariq School of Engineering

Dynamo: A Runtime Codesign Environment

Embedded Systems Design

Vector Processing => Multimedia

Drinking from the Firehose Decode in the Mill™ CPU Architecture

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Multivector and SIMD Computers

Chapter 1 Introduction.

Spring 2008 CSE 591 Compilers for Embedded Systems

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

COMPUTER ORGANIZATION AND ARCHITECTURE

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

6 th /June, ISCA2005, 1/30NEC Corporation An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems *1 Shorin KYO *1 Shin'ichiro OKAZAKI *2 Tamio ARAI *1 Media and Information Research Laboratories, NEC Corporation *2 School of Engineering, University of Tokyo

6 th /June, ISCA2005, 2/30NEC Corporation 1.Challenges of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluations 5. Summary Outline

6 th /June, ISCA2005, 3/30NEC Corporation Three Basic Requirements 1) High Performance 2) Cost/Power Efficiency 3) High Flexibility (Scalability and Versatility) Low cost Easy cooling (< 2 Watt) High Quality / Reliability Low EMI Able to handle the combination of [ applications × situations×targets ] Robustness GOPS Lane Marks Preceding obstacles Side/back obstacles Traffic signs, pedestrians Ex. Embedded Driver Asistant Systems Realtime Response

6 th /June, ISCA2005, 4/30NEC Corporation Applications × Situations × Targets Dynamic Back Up Aid Cross Traffic Warning Following Distance Warning Park Slot Measurement Backup Parking Assist Stop&Go Side Pre-Crash Cut-In Front Pre-Crash Lane Change Assist Pedestrian Protection Blind Spot Detection Drownsiness warning Traffic Sign Recognition

6 th /June, ISCA2005, 5/30NEC Corporation Control circuit Cost （ Die size / power consumption ） Operation circuit (peak) performanceFlexibility 100 Itanium Sparc64 SPE(CELL) FR1000 FR500 IMAP-CE, IMAPCAR CODEC LSI a) Desktop/Server CPU (GPPs) b) MIMDs (Multi-Cores) c) DSPs d) Highly parallel SIMDs e) Special purpose LSI % of Control Circuitry % of Operational Circuitry (Flexibility) (Performance) COR: Control versus Operational circuit Ratio 1) Performance (higher) 2) Cost (lower) 3) Flexibility (higher) Trading-off items

6 th /June, ISCA2005, 6/30NEC Corporation (a) GPPs (b) DSPs and MIMDs (c) Highly parallel SIMDs (d) Custom logics+DSP core (e) Custom logics only Flexibility Performance a) b) c) d) e) Ctrl. circuits Op. circuit Ctrl. circuits Op. Op. circuit Fixed Cost & Technology Constrain (a Technology Barrier) Flexibility gap Challenge of embedded image processors ⇒ Minimizing COR while overcoming the "Flexibility Gap" Overcoming the Flexibility Gap Ctrl.

6 th /June, ISCA2005, 7/30NEC Corporation 1.Challenge of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary Outline

6 th /June, ISCA2005, 8/30NEC CorporationIMAP-CE IMAP-1 IMAP-VISION MHz, 32PE/Chip 15MHz, 8PE /Chip Peak Performance(GOPS) 100MHz, 128PE/Chip 4-Way VLIW,50GOPS 0.18um, 2 ～ 4Watt IMAP-2 40MHz, 64PE/Chip IMAPCAR 100MHz, 128PE/Chip 4-Way VLIW+MAC, 100GOPS (-40 ℃～ 85 ℃ ), 0.13 um, <2Watt 1000 IMAP Series Processors ISSCC’03 ISSCC’95 Year 11.0mm PE8 CP EXTIF DPLL IMAP-CE( 32.7M Tr, 0.18um ) (PE8: eight PEs integration block) CAMP’97

6 th /June, ISCA2005, 9/30NEC Corporation Block Diagram and Features Video IN Video OUT P$,D$,STK RAM EMEM Host Processor Control Processor (CP) 4 Way VLIW PE SR0 SR1 SR2 IMEM External Mem. I/F 12.8 GByte/s 0.8 GByte/s SR3 128 EMEM ADD MUL RDU 24 x 8b General Purpose Registers To/Fr other PEs To/Fr IMEM LSU COMM To/Fr CP LOG 4)128 individual RAM blocks configuration 5)1DC (One Dimensional C) + “Line methods” 6)Enhanced PE instruction set design for 1DC 1)100MHz 128 4Way VLIW linear array PEs 2)Two level memory architecture + user DMA 3)Automated mapping of image data to each PE PE one pixel data IMEM of one PE column(s) of image source (image) data PE CP instruction broadcast (SIMD) SDRAM/ SSRAM 2KB MB ～ ALUx1,MULx1,LOGx1,LSUx1

6 th /June, ISCA2005, 10/30NEC Corporation Memory Access Pattern Categories Input Image X (RNO) Recursive Neigh. Op. Output Image Y High-level Decision Local Feature based Discrimination Measurements Low-level Image Processing Intermediate-level Image Processing pixels symbols Output Image Y Input Image X Point Op. (PO) Input Image X Output vector / scalar V Statistical Op. (SO) Input Image X Output vector / scalar V Object Op. (OO) Higher level Feature extraction Low-level Feature Extraction Output Image Y Input Image X Global Op. (GlO) Output Image YInput Image X Geometric. Op. (GeO) Output Image YInput Image X Local Neigh. Op. (LNO) Pre-processing Sensors Image processing Image recognition E.R.Komen: Low-level Image Processing Architectures, Ph.d Thesis, TUD,Netherlands, P.P.Jonker: Architectures for Multidimensional Low- and Intermidiate Level Image rocessing, Proc. of IAPR Workshop on Machine Vision Applications (MVA'90), pp , ex. affine ex. 2d-filters,NN ex. labelling/propagation ex. distance trans. ex. FFT ex. histogram

6 th /June, ISCA2005, 11/30NEC Corporation Recursive Data dependent Conventional continous (or strided) address data supply (ex. streaming data supply) is not sufficient for parallelizing most memory access patterns been required PO ○ LNO ○ SO × GlO × GeO × RNO × OO × Global Completely local Local Neighborhood Unified RAM PE SIMD + VLIW PEs Memory Access Pattern Parallelization Issue

6 th /June, ISCA2005, 12/30NEC Corporation Unconstrained pixel update Constrained pixel update Statically constraineddynamically constrained update location is statically predictable update location must be dynamically determined No Yes SO, GlO,GeO － PO, LNO RNO OO － Locality slant-systolic PE autonomous PE row-systolic PE row-wise (PUL) PE image requires one RAM block / PE configuration Memory Access Pattern Parallelization Design (PUL: Pixel Updating Line) Line Methods

6 th /June, ISCA2005, 13/30NEC Corporation 90 degree rotation Thinning Connect component labeling Line Methods (1) ー Combination of PULs ー PE + Propagation PE ++ 2 times

6 th /June, ISCA2005, 14/30NEC Corporation N/2 ～ N time speedup by N PEs *1 *2 *1: When under an unified RAM approach *2: When using the memory array architecture Line Methods (2) ー Expected Speedup ー (when using N PEs)

6 th /June, ISCA2005, 15/30NEC Corporation 1.Challenge of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary Outline

6 th /June, ISCA2005, 16/30NEC Corporation int d, e; sep char a,b; sep char c,ary[256]; One (vector like) data structure and six operators 1DC: An Extended C Language Correspondence between parallelizing techniques and the 1DC syntax.

6 th /June, ISCA2005, 17/30NEC Corporation Sequential Languages (Ex. C) for (y=0; y < {number of lines} ； y++) for (x=0; x < {number of columns}; x++) When using 1DC, skip the {number of columns} loop for (y=0; y < {number of lines} ； y++) y=0y=120 y=200y= {number of lines} Ex. An Edge Detection Filter 1DC: Line-wise Parallel Operation

6 th /June, ISCA2005, 18/30NEC Corporation src[i] src[i+1] ＋ a8 a6 ････ b7b8 b6 ････ c7c8 c6 ････ src[i-1] a7 ＋････ a7+b7+c7 ↓ csum a8+b8+c8 a6+b6+c6 ＋＋ = src[256], dst[256]; sep uchar src[256], dst[256]; ave33( ){ int i; sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; } Summing three lines at the same time Average Filter in 1DC (1)

6 th /June, ISCA2005, 19/30NEC Corporation ave33( ){ int i; sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; } csum :<csum ＋････ :>csum ＋ ↓ dst[i] a9+b9+c9 a7+b7+c7 a5+b5+c5 a6+b6+c6 a7+b7+c7 a8+b8+c8 a6+b6+c6 a7+b7+c7 a8+b8+c8 a5+b5+c5 a6+b6+c6 a7+b7+c7 a7+b7+c7 a6+b6+c6 a7+b7+c7 a8+b8+c8 a9+b9+c9 a8+b8+c8 ＋＋ = Neigh. ref.(:>,:<) and “ + ” Average Filter in 1DC (2)

6 th /June, ISCA2005, 20/30NEC Corporation Fast PE grouping PE array Systolic PE array Slant Autonomous PE array Row Toward Efficient Execution of 1DC Codes Pipelined data exchange Fast left/right referencing 1DC program 1DC compiler / linker Fast index addressing Video IN Video OUT P$,D$,STK RAM Host Processor Control Processor (CP) 4 Way VLIW PE SR0 SR1 SR2 IMEM External Mem. I/F SR3 128 SDRAM/SSRAM

6 th /June, ISCA2005, 21/30NEC Corporation Programming Environment Assign variables to sliders Timing measurement result for each source code line 1DC Source code window Real-time value tuning debugging Source image window Image recognition result window 1DC Optimizing Compiler 1DC Symbolic Debugger 1DC Source Code Library IMAP Assembler Linker IMAP-CE PCI board

6 th /June, ISCA2005, 22/30NEC Corporation 1.Challenge of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary Outline

6 th /June, ISCA2005, 23/30NEC Corporation Operation Group Kernels Flexibility against various memory access patterns Op. Grp.Kernel Name IPC POColor format trans LNO3x3 ave. filter 1.33 SOHistogram 1.66 GlOFFT 1.55 GeO90 degree rotation 1.23 RNODistance transform 1.52 OOConnected component labeling 1.40 speedupparallelism (max.128) 1DC compiler codes Intel C compiler codes Operation group kernels

6 th /June, ISCA2005, 24/30NEC Corporation namePurpose Add2 dyadic arithmetic GreyOpen33x3 grey morphology Gauss5 5x5 filter Mexican13 13x13 conv. Var5Oct 5x5 texture analysis Canny edge detection (3x3) Smoothing edge preserving smoothing (7x7) speed-up PO LNO ProcessorOp.Freq. PE # Peak Perf. P4(SIMD)2.4GHz 1PEx8x238.4 GOPS IMAP-CE100MHz 128PEx451.2 GOPS IMAP-CE GPP x 1/24 x 32 x 1.33 Flexibility against algorithmic complexity GOPS : in byte operation Highly Parallel vs. Sub-Word SIMD # of if-clause per pixel op. 1DC compiler codes MMX codes Benchmark kernels Only PO,LNO kernels are used due to the nature of MMX inst.

6 th /June, ISCA2005, 25/30NEC Corporation Compared with Some Recent Media Processors PE Image 128 bank memory PE (scratch pad memories) SRF of Imagine (Stanford) Frame Buffer of Morphosys (UC) Local Store of SPE(CELL:Sony) 2KB One to several banks On chip vector partitioning & chaining VIRAM (UCB), CODE (Stanford) static vector partitioning IMAP 1024 point 1D-FFT performance compared with other media processors PE Processor NameCycle count Word Size Die-sizePwr(W)Tech(um) Imagine(Float) * Morphosys * IMAP-CE(IMAPCAR)5000(3700)811*114(2)0.18(0.13) VIRAM *

6 th /June, ISCA2005, 26/30NEC Corporation use 1DC use C A Real Application － Vehicle Detection － Flexibility at the application level Search Tracking vechicles Validate Lane Mark Detection four local windowsin max. six vehicles foreward looking camera

6 th /June, ISCA2005, 27/30NEC Corporation Processing time distribution The Uneven Workload Issue PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PE array fully utilized Partial activation of PE array during sequential validatation of each candidate area Search Validation

6 th /June, ISCA2005, 28/30NEC Corporation 1.Challenge of Embedded Image Recognition Systems 2.Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary Outline

6 th /June, ISCA2005, 29/30NEC Corporation Summary Technology Barrier (c) (a) (b) (d) GPPs Highly parallel SIMD Media Extended DSPs Flexibility Performance (e) 1) High Performance 2) Low Cost/ High Reliability 3) High Flexibility Parallel and systolic algorithm design methodology + Hardware support of parallelizing methods + Extended C Compiler & GUI Debugger The IMAP approach Wired logics (+DSP core) Assembly programmed DSPs Flexibility Gap Embedded Image Recognition Processor

6 th /June, ISCA2005, 30/30NEC Corporation The END (Thank you for your attention)