Connex Technology Proprietary and Confidential 1 The CA1024: A Massively Parallel Processor for Cost-Effective HDTV
Connex Technology Proprietary and Confidential 2 Fabless semiconductor company in Silicon Valley VC funded (series A & B) In the product-development stage with 26+ employees –Deep experience with video algorithms, processor design, and digital-video system software Core asset: ConnexArray TM vector-processor architecture –Architecture verified in CA4096 test chip Six patent applications on Connex vector-processor technology –1 US patent granted, 3 US patents pending, 2 US provisional –Granted and pending patents also filed in China, Taiwan, Korea, EEC, Japan, Singapore Initial market focus on DTV Company Background
Connex Technology Proprietary and Confidential 3 Presentation Agenda Why a massively parallel processor (MPP)? How is MPP integrated in an SoC? Processor performance Project status
Connex Technology Proprietary and Confidential 4 HDTV codec & post-processing are computationally intensive Computation is dominated by data- parallel processes HDTV is a fast-evolving domain ASICs are a very costly solution Challenges
Connex Technology Proprietary and Confidential 5 Our Solution: Integral Parallel Machine Data-parallel computation Time-parallel computation (supported by speculative parallelism) I/O process is transparent to the computational process
Connex Technology Proprietary and Confidential 6 Key Technology Fully programmable solution for HDTV video encoding, decoding, and transcoding at the system and algorithm levels –Simple programming model Silicon-efficient architecture; die size competitive with similar function ASICs –Re-use of transistors –Minimal dedicated hard-wired blocks Sufficient performance to enable multistandard, multichannel, high-definition DTV –Linearly scalable
Connex Technology Proprietary and Confidential 7 The Connex Architecture 1 I/O Controller Connex Array 0 1 n 02 m CA1024-PVP: m = n = 32 for a 1,024-PE Connex Machine Test Chip: m = n = 64 for a 4,096-PE Connex Array; sequencer and I/O control in an FPGA 3.2 GByte/sec I/O channel in parallel with code running on the Connex Array Connex I/O AUX 16-bit RAM Address Select Index 16 bit ALU Sequencer 255 R0 R R2 R3 R4 R5 R6 R7
Connex Technology Proprietary and Confidential 8 16 bit ALU Connex Cell Architecture PE (Processing Element) has eight accumulator registers, including Connex, Aux, and I/O special- function registers Select flag enables or disables instruction processing Index is a unique cell number used to direct certain instructions Bidirectional 16-bit bus to 256 RAM locations Connex register includes connections for shifts to/from adjacent PE Aux and I/O registers dedicated to specific instruction functions Address 0 Connex I/O AUX RAM Index R0 R1 R2 R3 R4 R5 R6 R7 Select
Connex Technology Proprietary and Confidential 9 16 bit ALU 16 bit ALU 16 bit ALU ConnexArray Structure Replicated Connex cells each include PE and local RAM Linear interconnect of neighbor registers Conditional execution based on state of select bit or index value All selected cells execute the same instruction stream R0 R R2 R3 R4 R5 R6 R7 1 On 1023 R0 R1 0 1 On 0 Off R2 R3 R4 R5 R6 R7 255 R0 R R2 R3 R4 R5 R6 R7
Connex Technology Proprietary and Confidential 10 Connex Data-Array Structure Element n Line m 16-bit data operands 256 lines with bit elements per line 1GByte data I/O in parallel with computation operations
Connex Technology Proprietary and Confidential 11 Full Line Operations: Operate On All Elements in Parallel Line i Line k Line j +, -, *, XOR, etc. = Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements)
Connex Technology Proprietary and Confidential 12 Columns Active Based On Repeating Patterns Line i Line k Line j +, -, *, XOR, etc. = Example: Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc.
Connex Technology Proprietary and Confidential 13 Columns Active Based On Results of Previous Operations Line i Line k Line j +, -, *, XOR, etc. = Example: Apparently random columns are active, marked, based on Data-dependent results of previous operations. This enables selective processing based on data content.
Connex Technology Proprietary and Confidential Line i Line j Example: 128 sets of 8x8 run in parallel in a 1024-cell array 7 7 8x8 Outer-Loop Parallelism: Program in context of 128+ data-structure instances Example: 8x8 DCT ……..
Connex Technology Proprietary and Confidential 15 I/O System I/O Plane Connex Array IOC Switch Fabric IS Interrupts DDR-DRAM Controller DRAM
Connex Technology Proprietary and Confidential 16 Computational-Intensive Architecture All forms of parallelism are strongly segregated –Connex Array for data-parallel computation –Speculative Array for time-parallel computation The granularity perfectly fits the application domain –16-bit processing elements –no MACs, no FPUs, no multipliers…
Connex Technology Proprietary and Confidential 17 High I/O Bandwidth External I/O: 3.2 GBytes/sec –Serial access and random access with similar performance Internal I/O: 400 GBytes/sec
Connex Technology Proprietary and Confidential 18 Area & Power Efficiency 2 GOPS/mm 2 (peak performance) GOPS/Watt is 25–50 times greater than a mature sequential technology
Connex Technology Proprietary and Confidential 19 Programming Connex CPL (Connex Programming Language) is an extension of C with C/C++ syntax Code that operates on scalar data is written in regular C notation Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections CPL uses sequential operators and control structures on vector and select datatypes Using CPL, the Connex Machine is programmed the same way as conventional sequential machines Hides the complexities of the parallel execution hardware Complete SDK {... const short OFFSET = 15;... short vector x, y; short vector min, max;... sel = all; x += OFFSET;... min = (x < y)? x : y; max = (x > y)? x : y;... } Vectors are arrays of scalar components. Selections are arrays of Boolean values that dictate which vector components are active.
Connex Technology Proprietary and Confidential 20 Performance DCT: 0.35 clock cycle per pixel SAD: clock cycle per pixel
Connex Technology Proprietary and Confidential 21 H.264 Dual HD Stream Decoding Clock Cycles Per Macroblock Dezigzagging 37.3 Intra Prediction54.1 IT/IQ97.3 Motion Compensation Deblocking Filter 27.1 Total [ Clock Cycles/Macroblock ] Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles
Connex Technology Proprietary and Confidential 22 H.264 CABAC (SA) Decoding Targeted profile and level: 4.1 Main Profile Bit-rate/stream considered: 35Mbps (45Mbps maximum) Number of bins to decode using CABAC : 47M/sec Number of clock cycles per bin: 1 cycle Cycles to decode bins/stream: 50MHz Typical bit-rate expected for DVB: 10Mbps Cycles to decode bins for typical stream (DVB): 15MHz
Connex Technology Proprietary and Confidential 23 Switch Fabric Audio Out Video Out Video Out HOST I/F Audio Out Ext. Bus Audio In Audio In Video In Video In Test ICE PCI v2.2 or Generic 64-bit Wide DRAM 5x-I2S 1xI2S BT.656/1120 Flash 2x-I2S or S/PDIF BT.656/1120 2x-I2S or S/PDIF BT.656/1120 DDR-DRAM Ctrl (400 MHz Data Rate) JTAG GPIOI2C S/PDIF SA Host CPU Audio CPU TS/Sec CPU Video CPU Instruction Sequencer Switch Fabric I/O Controller ConnexArray™ Programmable Media Processor Multi-Codec Processing Pre-Analysis 3D Filter Scaling Graphics Processing Video Merge/Blend Motion Adaptive De-interlacing CA1024 Switch Fabric
Connex Technology Proprietary and Confidential 24 CA1024 Project Status ACF MIPS PCI MIPS SA DDR CWOA CA256 TSMC 0.13 micron 676-pin PBGA Samples Q
Connex Technology Proprietary and Confidential 25 In Summary….. Fully programmable processor Computational-intensive architecture High-bandwidth I/O Connex Programming Language & SDK Die-area and power-efficient architecture
Connex Technology Proprietary and Confidential 26 Thank You !