Presentation is loading. Please wait.

Presentation is loading. Please wait.

The CA1024: A Massively Parallel Processor for Cost-Effective HDTV

Similar presentations


Presentation on theme: "The CA1024: A Massively Parallel Processor for Cost-Effective HDTV"— Presentation transcript:

1 The CA1024: A Massively Parallel Processor for Cost-Effective HDTV
Connex Technology Proprietary and Confidential

2 Connex Technology Proprietary and Confidential
Company Background Fabless semiconductor company in Silicon Valley VC funded (series A & B) In the product-development stage with 26+ employees Deep experience with video algorithms, processor design, and digital-video system software Core asset: ConnexArrayTM vector-processor architecture Architecture verified in CA4096 test chip Six patent applications on Connex vector-processor technology 1 US patent granted, 3 US patents pending, 2 US provisional Granted and pending patents also filed in China, Taiwan, Korea, EEC, Japan, Singapore Initial market focus on DTV Connex Technology Proprietary and Confidential

3 Connex Technology Proprietary and Confidential
Presentation Agenda Why a massively parallel processor (MPP)? How is MPP integrated in an SoC? Processor performance Project status Connex Technology Proprietary and Confidential

4 Connex Technology Proprietary and Confidential
Challenges HDTV codec & post-processing are computationally intensive Computation is dominated by data-parallel processes HDTV is a fast-evolving domain ASICs are a very costly solution Connex Technology Proprietary and Confidential

5 Our Solution: Integral Parallel Machine
Data-parallel computation Time-parallel computation (supported by speculative parallelism) I/O process is transparent to the computational process Connex Technology Proprietary and Confidential

6 Connex Technology Proprietary and Confidential
Key Technology Fully programmable solution for HDTV video encoding, decoding, and transcoding at the system and algorithm levels Simple programming model Silicon-efficient architecture; die size competitive with similar function ASICs Re-use of transistors Minimal dedicated hard-wired blocks Sufficient performance to enable multistandard, multichannel, high-definition DTV Linearly scalable Connex Technology Proprietary and Confidential

7 The Connex Architecture
255 254 Sequencer 16-bit RAM CA1024-PVP: m = n = 32 for a 1,024-PE Connex Machine Test Chip: m = n = 64 for a 4,096-PE Connex Array; sequencer and I/O control in an FPGA I/O Controller Connex Array 1 1 Address R7 R6 R5 R4 R3 AUX R2 I/O R1 Connex R0 n m Select Index 1 2 3.2 GByte/sec I/O channel in parallel with code running on the Connex Array 16 bit ALU Connex Technology Proprietary and Confidential

8 Connex Cell Architecture
255 254 PE (Processing Element) has eight accumulator registers, including Connex, Aux, and I/O special-function registers Select flag enables or disables instruction processing Index is a unique cell number used to direct certain instructions Bidirectional 16-bit bus to 256 RAM locations Connex register includes connections for shifts to/from adjacent PE Aux and I/O registers dedicated to specific instruction functions RAM 1 Address 0 R7 R6 R5 R4 R3 AUX R2 I/O R1 Connex R0 Select Index 16 bit ALU Connex Technology Proprietary and Confidential

9 ConnexArray Structure
Replicated Connex cells each include PE and local RAM Linear interconnect of neighbor registers Conditional execution based on state of select bit or index value All selected cells execute the same instruction stream 255 R0 R1 1 254 R2 R3 R4 R5 R6 R7 255 R0 R1 1 254 R2 R3 R4 R5 R6 R7 255 254 1 R7 R6 R5 R4 R3 R2 R1 R0 On On Off 1 1023 16 bit ALU 16 bit ALU 16 bit ALU Connex Technology Proprietary and Confidential

10 Connex Data-Array Structure
Element n 1023 16-bit data operands Line m 255 256 lines with bit elements per line 1GByte data I/O in parallel with computation operations Connex Technology Proprietary and Confidential

11 Full Line Operations: Operate On All Elements in Parallel
1023 Line i +, -, *, XOR, etc. Line j = Line k 255 Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements) Connex Technology Proprietary and Confidential

12 Columns Active Based On Repeating Patterns
1023 Line i +, -, *, XOR, etc. Line j = Line k 255 Example: Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc. Connex Technology Proprietary and Confidential

13 Columns Active Based On Results of Previous Operations
1023 Line i +, -, *, XOR, etc. Line j = Line k 255 Example: Apparently random columns are active, marked, based on Data-dependent results of previous operations. This enables selective processing based on data content. Connex Technology Proprietary and Confidential

14 Connex Technology Proprietary and Confidential
Outer-Loop Parallelism: Program in context of 128+ data-structure instances Example: 8x8 DCT 7 1023 8x8 8x8 …….. 8x8 8x8 7 Line i Line j 255 Example: 128 sets of 8x8 run in parallel in a 1024-cell array Connex Technology Proprietary and Confidential

15 Connex Technology Proprietary and Confidential
I/O System Switch Fabric Connex Array IS I/O Plane IOC Interrupts DRAM DDR-DRAM Controller DRAM DRAM DRAM Connex Technology Proprietary and Confidential

16 Computational-Intensive Architecture
All forms of parallelism are strongly segregated Connex Array for data-parallel computation Speculative Array for time-parallel computation The granularity perfectly fits the application domain 16-bit processing elements no MACs, no FPUs, no multipliers… Connex Technology Proprietary and Confidential

17 Connex Technology Proprietary and Confidential
High I/O Bandwidth External I/O: 3.2 GBytes/sec Serial access and random access with similar performance Internal I/O: 400 GBytes/sec Connex Technology Proprietary and Confidential

18 Area & Power Efficiency
2 GOPS/mm2 (peak performance) GOPS/Watt is 25–50 times greater than a mature sequential technology Connex Technology Proprietary and Confidential

19 Connex Technology Proprietary and Confidential
Programming Connex CPL (Connex Programming Language) is an extension of C with C/C++ syntax Code that operates on scalar data is written in regular C notation Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections CPL uses sequential operators and control structures on vector and select datatypes Using CPL, the Connex Machine is programmed the same way as conventional sequential machines Hides the complexities of the parallel execution hardware Complete SDK { ... const short OFFSET = 15; ... short vector x, y; short vector min, max; sel = all; x += OFFSET; min = (x < y)? x : y; max = (x > y)? x : y; } Vectors are arrays of scalar components. Selections are arrays of Boolean values that dictate which vector components are active. Try doing the function in the short program segment shown above on a Von Neumann machine……….! It takes orders of magnitude more clock cycles than the 2 clock cycles required by a Connex machine. Connex Technology Proprietary and Confidential

20 Connex Technology Proprietary and Confidential
Performance DCT: 0.35 clock cycle per pixel SAD: clock cycle per pixel Connex Technology Proprietary and Confidential

21 H.264 Dual HD Stream Decoding
Clock Cycles Per Macroblock Dezigzagging   37.3 Intra Prediction 54.1 IT/IQ 97.3 Motion Compensation 114.3 Deblocking Filter   27.1 Total [ Clock Cycles/Macroblock ] 337.8 Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles Connex Technology Proprietary and Confidential

22 Connex Technology Proprietary and Confidential
H.264 CABAC (SA) Decoding Targeted profile and level: 4.1 Main Profile Bit-rate/stream considered: 35Mbps (45Mbps maximum) Number of bins to decode using CABAC : 47M/sec Number of clock cycles per bin: 1 cycle Cycles to decode bins/stream: 50MHz Typical bit-rate expected for DVB: 10Mbps Cycles to decode bins for typical stream (DVB): 15MHz Connex Technology Proprietary and Confidential

23 CA1024 ConnexArray™ SA JTAG GPIO I2C Ext. Bus Controller I/O
64-bit Wide DRAM Test ICE DDR-DRAM Ctrl (400 MHz Data Rate) JTAG GPIO I2C Switch Fabric Audio Out Video HOST I/F BT.656/1120 Ext. Bus Audio In Video BT.656/1120 ConnexArray™ Programmable Media Processor Multi-Codec Processing Pre-Analysis 3D Filter Scaling Graphics Processing Video Merge/Blend Motion Adaptive De-interlacing BT.656/1120 BT.656/1120 2x-I2S or S/PDIF Controller I/O 5x-I2S Switch Fabric Switch Fabric S/PDIF 2x-I2S or S/PDIF 1xI2S Instruction Sequencer PCI v2.2 or Generic Flash Switch Fabric CA1024 Host CPU TS/Sec CPU Audio CPU Video CPU SA Connex Technology Proprietary and Confidential

24 Connex Technology Proprietary and Confidential
CA1024 Project Status ACF MIPS PCI SA DDR CWOA CA256 TSMC 0.13 micron 676-pin PBGA Samples Q3 2006 Connex Technology Proprietary and Confidential

25 Connex Technology Proprietary and Confidential
In Summary….. Fully programmable processor Computational-intensive architecture High-bandwidth I/O Connex Programming Language & SDK Die-area and power-efficient architecture Connex Technology Proprietary and Confidential

26 Connex Technology Proprietary and Confidential
Thank You ! Connex Technology Proprietary and Confidential


Download ppt "The CA1024: A Massively Parallel Processor for Cost-Effective HDTV"

Similar presentations


Ads by Google