Connex Technology Proprietary and Confidential 1 The CA1024: A Massively Parallel Processor for Cost-Effective HDTV.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.
Microprocessors Typical microprocessor controlled devices: Camera, mobile phone, stereo, mp3 player, electronic toys… High-level microprocessor controlled.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
The University of Adelaide, School of Computer Science
Processor System Architecture
Design center Vienna Donau-City-Str. 1 A-1220 Vienna Vers SVEN Scalable Video Engine Gerald Krottendorfer.
Computer Architecture and Data Manipulation Chapter 3.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
Computer System Overview
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 6 Memory and Programmable Logic Devices
Unit-1 PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE Advance Processor.
Eye-RIS. Vision System sense – process - control autonomous mode Program stora.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
1 Background The latest video coding standard H.263 -> MPEG4 Part2 -> MPEG4 Part10/AVC Superior compression performance 50%-70% bitrate saving (H.264 v.s.MPEG-2)
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Membrane Computing in Connex Environment 1WMC8 June 2007 Membrane Computing in the Connex Environment Gheorghe Stefan BrightScale Inc., Sunnyvale, CA &
Internal hardware and external components of a computer Three-box Model  Processor The brain of the system Executes programs A big finite state machine.
Understanding Computers, Ch.31 Chapter 3 The System Unit: Processing and Memory.
One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan
CPU Computer Hardware Organization (How does the computer look from inside?) Register file ALU PC System bus Memory bus Main memory Bus interface I/O bridge.
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
István Lőrentz 1 Mihaela Malita 2 Răzvan Andonie 3 Mihaela MalitaRăzvan Andonie 3 (presenter) 1 Electronics and Computers Department, Transylvania University.
Computer Organization & Assembly Language © by DR. M. Amer.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Computer Hardware A computer is made of internal components Central Processor Unit Internal External and external components.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
HOW a Computer Works ? Anatomy of Microprocessor.
Electronic Analog Computer Dr. Amin Danial Asham by.
MICROPROCESSOR FUNCTION Technician Series Created Mar 2015 gmail.com.
Stored Program A stored-program digital computer is one that keeps its programmed instructions, as well as its data, in read-write,
Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Overview von Neumann Architecture Computer component Computer function
Chapter 2 Turning Data into Something You Can Use
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Embedded System for Video Coding with Logic-Enhanced DRAM and Configurable Processor Toshiyuki KAYA*, Ryusuke MIYAMOTO**, Takao ONOYE*, **, Isao SHIRAKAWA*
CSC 360- Instructor: K. Wu Review of Computer Organization.
Lecture # 10 Processors Microcomputer Processors.
System on a Programmable Chip (System on a Reprogrammable Chip)
PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER
System On Chip.
Embedded Systems Design
Architecture & Organization 1
FPGAs in AWS and First Use Cases, Kees Vissers
Architecture & Organization 1
Microprocessor & Assembly Language
The CA1024: A Massively Parallel Processor for Cost-Effective HDTV
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Characteristics of Reconfigurable Hardware
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
What Choices Make A Killer Video Processor Architecture?
CSE 502: Computer Architecture
ADSP 21065L.
Presentation transcript:

Connex Technology Proprietary and Confidential 1 The CA1024: A Massively Parallel Processor for Cost-Effective HDTV

Connex Technology Proprietary and Confidential 2 Fabless semiconductor company in Silicon Valley VC funded (series A & B) In the product-development stage with 26+ employees –Deep experience with video algorithms, processor design, and digital-video system software Core asset: ConnexArray TM vector-processor architecture –Architecture verified in CA4096 test chip Six patent applications on Connex vector-processor technology –1 US patent granted, 3 US patents pending, 2 US provisional –Granted and pending patents also filed in China, Taiwan, Korea, EEC, Japan, Singapore Initial market focus on DTV Company Background

Connex Technology Proprietary and Confidential 3 Presentation Agenda Why a massively parallel processor (MPP)? How is MPP integrated in an SoC? Processor performance Project status

Connex Technology Proprietary and Confidential 4 HDTV codec & post-processing are computationally intensive Computation is dominated by data- parallel processes HDTV is a fast-evolving domain ASICs are a very costly solution Challenges

Connex Technology Proprietary and Confidential 5 Our Solution: Integral Parallel Machine Data-parallel computation Time-parallel computation (supported by speculative parallelism) I/O process is transparent to the computational process

Connex Technology Proprietary and Confidential 6 Key Technology Fully programmable solution for HDTV video encoding, decoding, and transcoding at the system and algorithm levels –Simple programming model Silicon-efficient architecture; die size competitive with similar function ASICs –Re-use of transistors –Minimal dedicated hard-wired blocks Sufficient performance to enable multistandard, multichannel, high-definition DTV –Linearly scalable

Connex Technology Proprietary and Confidential 7 The Connex Architecture 1 I/O Controller Connex Array 0 1 n 02 m CA1024-PVP: m = n = 32 for a 1,024-PE Connex Machine Test Chip: m = n = 64 for a 4,096-PE Connex Array; sequencer and I/O control in an FPGA 3.2 GByte/sec I/O channel in parallel with code running on the Connex Array Connex I/O AUX 16-bit RAM Address Select Index 16 bit ALU Sequencer 255 R0 R R2 R3 R4 R5 R6 R7

Connex Technology Proprietary and Confidential 8 16 bit ALU Connex Cell Architecture PE (Processing Element) has eight accumulator registers, including Connex, Aux, and I/O special- function registers Select flag enables or disables instruction processing Index is a unique cell number used to direct certain instructions Bidirectional 16-bit bus to 256 RAM locations Connex register includes connections for shifts to/from adjacent PE Aux and I/O registers dedicated to specific instruction functions Address 0 Connex I/O AUX RAM Index R0 R1 R2 R3 R4 R5 R6 R7 Select

Connex Technology Proprietary and Confidential 9 16 bit ALU 16 bit ALU 16 bit ALU ConnexArray Structure Replicated Connex cells each include PE and local RAM Linear interconnect of neighbor registers Conditional execution based on state of select bit or index value All selected cells execute the same instruction stream R0 R R2 R3 R4 R5 R6 R7 1 On 1023 R0 R1 0 1 On 0 Off R2 R3 R4 R5 R6 R7 255 R0 R R2 R3 R4 R5 R6 R7

Connex Technology Proprietary and Confidential 10 Connex Data-Array Structure Element n Line m 16-bit data operands 256 lines with bit elements per line 1GByte data I/O in parallel with computation operations

Connex Technology Proprietary and Confidential 11 Full Line Operations: Operate On All Elements in Parallel Line i Line k Line j +, -, *, XOR, etc. = Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements)

Connex Technology Proprietary and Confidential 12 Columns Active Based On Repeating Patterns Line i Line k Line j +, -, *, XOR, etc. = Example: Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc.

Connex Technology Proprietary and Confidential 13 Columns Active Based On Results of Previous Operations Line i Line k Line j +, -, *, XOR, etc. = Example: Apparently random columns are active, marked, based on Data-dependent results of previous operations. This enables selective processing based on data content.

Connex Technology Proprietary and Confidential Line i Line j Example: 128 sets of 8x8 run in parallel in a 1024-cell array 7 7 8x8 Outer-Loop Parallelism: Program in context of 128+ data-structure instances Example: 8x8 DCT ……..

Connex Technology Proprietary and Confidential 15 I/O System I/O Plane Connex Array IOC Switch Fabric IS Interrupts DDR-DRAM Controller DRAM

Connex Technology Proprietary and Confidential 16 Computational-Intensive Architecture All forms of parallelism are strongly segregated –Connex Array for data-parallel computation –Speculative Array for time-parallel computation The granularity perfectly fits the application domain –16-bit processing elements –no MACs, no FPUs, no multipliers…

Connex Technology Proprietary and Confidential 17 High I/O Bandwidth External I/O: 3.2 GBytes/sec –Serial access and random access with similar performance Internal I/O: 400 GBytes/sec

Connex Technology Proprietary and Confidential 18 Area & Power Efficiency 2 GOPS/mm 2 (peak performance) GOPS/Watt is 25–50 times greater than a mature sequential technology

Connex Technology Proprietary and Confidential 19 Programming Connex CPL (Connex Programming Language) is an extension of C with C/C++ syntax Code that operates on scalar data is written in regular C notation Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections CPL uses sequential operators and control structures on vector and select datatypes Using CPL, the Connex Machine is programmed the same way as conventional sequential machines Hides the complexities of the parallel execution hardware Complete SDK {... const short OFFSET = 15;... short vector x, y; short vector min, max;... sel = all; x += OFFSET;... min = (x < y)? x : y; max = (x > y)? x : y;... } Vectors are arrays of scalar components. Selections are arrays of Boolean values that dictate which vector components are active.

Connex Technology Proprietary and Confidential 20 Performance DCT: 0.35 clock cycle per pixel SAD: clock cycle per pixel

Connex Technology Proprietary and Confidential 21 H.264 Dual HD Stream Decoding Clock Cycles Per Macroblock Dezigzagging 37.3 Intra Prediction54.1 IT/IQ97.3 Motion Compensation Deblocking Filter 27.1 Total [ Clock Cycles/Macroblock ] Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles

Connex Technology Proprietary and Confidential 22 H.264 CABAC (SA) Decoding Targeted profile and level: 4.1 Main Profile Bit-rate/stream considered: 35Mbps (45Mbps maximum) Number of bins to decode using CABAC : 47M/sec Number of clock cycles per bin: 1 cycle Cycles to decode bins/stream: 50MHz Typical bit-rate expected for DVB: 10Mbps Cycles to decode bins for typical stream (DVB): 15MHz

Connex Technology Proprietary and Confidential 23 Switch Fabric Audio Out Video Out Video Out HOST I/F Audio Out Ext. Bus Audio In Audio In Video In Video In Test ICE PCI v2.2 or Generic 64-bit Wide DRAM 5x-I2S 1xI2S BT.656/1120 Flash 2x-I2S or S/PDIF BT.656/1120 2x-I2S or S/PDIF BT.656/1120 DDR-DRAM Ctrl (400 MHz Data Rate) JTAG GPIOI2C S/PDIF SA Host CPU Audio CPU TS/Sec CPU Video CPU Instruction Sequencer Switch Fabric I/O Controller ConnexArray™ Programmable Media Processor Multi-Codec Processing Pre-Analysis 3D Filter Scaling Graphics Processing Video Merge/Blend Motion Adaptive De-interlacing CA1024 Switch Fabric

Connex Technology Proprietary and Confidential 24 CA1024 Project Status ACF MIPS PCI MIPS SA DDR CWOA CA256 TSMC 0.13 micron 676-pin PBGA Samples Q

Connex Technology Proprietary and Confidential 25 In Summary….. Fully programmable processor Computational-intensive architecture High-bandwidth I/O Connex Programming Language & SDK Die-area and power-efficient architecture

Connex Technology Proprietary and Confidential 26 Thank You !