Download presentation
Presentation is loading. Please wait.
1
Introducing the ConnX D2 DSP Engine
Introduced: August 24, 2009
2
Fastest Growing Processor / DSP IP Company
Customizable Dataplane Processor/DSP IP Licensing Leading provider of customizable Dataplane Processor Units (DPUs) Unique combination of processor & DSP IP cores + software design tools Customization enables improved power, cost, performance Standard DPU solutions for audio, video/imaging & baseband comms Dominant patent portfolio for configurable processor technology Broad-Based Success 150+ Licensees, including 5 of the top 10 semiconductor companies Shipping in high volume today (>200M/yr rate) Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09) 21% revenue growth in 2007, 25% in 2008 2
3
Focus: Dataplane Processing Units (DPUs)
DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance than CPU or DSP and providing better flexibility & verification than RTL Embedded Controller For Dataplane Processing Main Applications CPU Tensilica focus: Dataplane Processors 3
4
Maintenance and flexibility pushes DSP algorithms towards C-code
Communications DSP Trends / Challenges Code Size Increases Communications standards growing in number & complexity DSP algorithm code heavily integrated with more (and more complex) control code Maintenance and flexibility pushes DSP algorithms towards C-code Development Teams Shrink SOC development schedules tightening Tightening resource constraints (do more with less) Markets Changing Faster Market requirements in flux as economy wobbles Emerging standards evolve faster in the Internet age 4
5
Trends Within Licensable DSP Architectures
1st Generation Licensable DSP Cores Modest/Medium performance (single/dual MAC) Simple architecture (single issue, compound Instructions) Limited or no compiler support (mostly hand coded) 2nd Generation Licensable DSP Cores Added RISC like architecture features (register arrays) Improved compiler targets, but still assembly Some offer wide VLIW for performance Large area; code bloat Some offer wide SIMD for performance Good area/performance tradeoff No performance when vectorization fails 5
6
Vectorization Benefits (SIMD)
Loop counts can be reduced Data computation can be done in parallel Cheapest (hardware cost) method to get higher performance Example: 2-way SIMD performance benefit Before Vectorization After Vectorization Data7 Data6 Data7 Data6 Data4 Data5 Data5 Data2 Data3 Data4 Data0 Data1 Data3 Data2 2-way SIMD Execution Data1 Data0 Single Execution 6
7
VLIW Technology Parallel execution of Instructions
Effective use of multiple ALUs/MACs Compiler allocates instructions to VLIW slots Orthogonal allocation yields more flexibility Instruction #4 Instruction #3 Instruction #2 Instruction #3 Instruction #4 Instruction #1 Instruction #1 Instruction #2 Execution ALU VLIW Execution ALU1 VLIW Execution ALU2
8
Ideal 3rd Generation Licensable DSP
Ideal Characteristics VLIW capability for good performance on general code Parallelization of independent operations SIMD capability for good performance on loop code Data parallel execution Good C compiler target Reduce or eliminate need to assembly program Productivity benefit Small, compact size Keep costs down in brutally competitive markets 8
9
Tensilica - the Stealth DSP Company
Comms Audio Video Xtensa: Other Markets DSP Building Blocks Custom DSPs ConnX BBE 16 MAC 8 MAC and more Xtensa TIE 388VDO 8 MAC ConnX 545CK DSP Double Precision Acceleration Floating Point HW ConnX Vectra LX Quad MAC Single Precision Floating Point Unit ConnX D2 DIV32 Dual MAC HiFi 2 MUL32 MAC16 Single MAC 9 9
10
ConnX D2 DSP Engine - Overview
Dual 16b MAC Architecture with Hybrid SIMD / VLIW Optimum performance on a wide range of algorithms SIMD offers high data computation rate for DSP algorithms 2-way VLIW allows parallel instruction execution on SIMD and scalar code “Out of the Box” industry standard software compatibility TI C6x fixed-point C intrinsics supported Fully bit for bit equivalent with TI C6x ITU reference code fixed point C intrinsics directly supported Goals: Ease of Use, Low Area/Cost Click and go “Out of the Box” performance from standard C code Standard C and fixed point data types - 16-bit, 32-bit and 40-bit Advanced optimizing, vectorizing compiler Less than 70K gates (under 0.2mm2 in 65nm) ConnX DSP architecture is desgined for ease of use. Within the Xplorer tool the ConnX architecture is simply added by clicking the function within the GUI. Integration and optimization within the Xtensa hardware is automatic. The compiler and software tool chain is also automatically configured and non visible to the user. The compiler will automatically identify 10
11
Target Applications: ConnX D2
General purpose 16-bit DSP for a wide range of applications Embedded control VoIP gateways, voice-over-networks (including VoIP codecs) Femto-cell and pico-cell base stations Next generation disk drives, data storage Mobile terminals and handsets Home entertainment devices Computer peripherals, printers 11
12
ConnX D2 DSP: An ingredient of an Xtensa DPU
Hardware Use Model Click-button configuration option within Xtensa LX core Part of the Tensilica configurable core deliverable package Two reference configurations Typical DSP solution for high performance Small size for cost and power sensitive applications Full tool support from Tensilica High level simulators (SystemC), ISS and RTL Debugger and Trace Compiler, IDE and Operating Systems 12
13
ConnX D2 Processor Block Diagram (Typical)
14
ConnX D2 Engine Architecture
Local Memory and/or Cache Load Store Unit 32b AR Register Bank (32 bits) 32-bits 32-bits 32b XDD Register File (8 x 40-bits) 32b XDU Alignment Registers (4 x 32 bits) 40-bit, 32-bit & 16-bit integer Overflow State 40-bit, 32-bit & 16-bit fixed 8-bit 8-bit Carry State 8-bit 8-bit 16-bit vector 16-bit vector 16-bit real 16-bit imaginary Hi / Lo 16-bit select 16-bits DSP specific instructions Add-Bit-Reverse-Base and Add-Subtract : Useful for FFT implementation Add-Compare-Exchange : Useful for Viterbi implementation Add-Modulo : Circular buffer implementation. Useful for FIR implementation 16-bits 16-bits Addressing Modes Immediate Immediate updating Indexed Indexed updating Aligning updating Circular (instruction) Bit-reversed (instruction) 16-bits AR register bank used for Address XD_DR used for Data 14 14
15
ConnX D2 or Base ISA (register moves & C ops on register data)
ConnX D2 : Instruction Allocation Options 16-bit Instructions Base ISA 24-bit Instructions Base ISA or ConnX D2 Slot 0 ConnX D2 or Base ISA Slot 1 ConnX D2 or Base ISA (register moves & C ops on register data) VLIW Instructions (64-bits) Flexible allocation of instructions available to compiler Optimum use of VLIW slots (ConnX D2 or base ISA instructions) Improved performance and no code bloat (reduced NOPs) Reduce code size when algorithm is less performance intensive Modeless switching between instruction formats 15
16
ConnX D2 : SIMD with VLIW – Extra Performance
Combining SIMD and VLIW can give 6 times performance Example : Energy Calculation 127 A = ∑ Xn* Xn n=0 SIMD Computation 128 iteration C algorithm Instruction Execution (Control) loopgtz a3,.LBB52_energy # [3] l16si a3,a2,2 # [0*II+0] id:16 a+0x0 l16si a5,a2,4 # [0*II+1] id:16 a+0x0 l16si a6,a2,6 # [0*II+2] id:16 a+0x0 l16si a7,a2,8 # [0*II+3] id:16 a+0x0 mul16s a3,a3,a3 # [0*II+4] mul16s a5,a5,a5 # [0*II+5] mul16s a6,a6,a6 # [0*II+6] mul16s a7,a7,a7 # [0*II+7] addi.n a2,a2,8 # [0*II+8] add.n a3,a4,a3 # [0*II+9] add.n a3,a3,a5 # [0*II+10] add.n a3,a3,a6 # [0*II+11] add.n a4,a3,a7 # [0*II+12] Slot0 Slot1 Vectorization and SIMD gives double data computation performance VLIW gives 2 pipeline executions (one is SIMD) with auto-increment loads ConnX D2 architecture gives this combination and performance 416 cycles Base Xtensa Configuration ConnX D2: 64 cycles loop { # format XD2_FLIX_FORMAT xd2_la.d16x2s.iu xdd0,xdu0,a4,4; xd2_mulaa40.d16s.ll.hh xdd1,xdd0,xdd0 } One instruction (64-bit VLIW instruction) 16
17
When Vectorization is Not Possible Performance for scalar code bases
int energy(short *a, int col, int cols, int rows) { int i; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum; Energy computation of column ‘col’ in 2-D array Above code loop cannot be vectorized Non–contiguous memory accesses thwarts vectorizers Regular compilers can not map this code into traditional SIMD DSPs 17
18
When Vectorization is Not Possible Performance for scalar code bases
int energy(short *a, int col, int cols, int rows) { int i; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum; Confirmed that ConnX D2 and TI C6x compilers can not vectorize this code ConnX D2 compiler can however use VLIW to increase performance C-Code entry a1,32 blti a5,1,.Lt_0_2306 addx2 a2,a3,a2 slli a3,a4,1 addi.n a4,a5,-1 sub a2,a2,a3 { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a xd2_movi.d40 xdd1,0 } loopgtz a4,.LBB43_energy { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 ; xd2_mula32.d16s.ll_s1 xdd1,xdd0,xdd0 } ………… Generated Assembly Code ConnX D2 : One cycle within loop Load scalar 16-bits xdd0 is loaded with memory contents defined in a2 register. a2 register value is updated by value in a3 MAC operation on lower 16-bits. Multiplies xdd0 with xdd0. Accumulated result is stored in xdd1 18
19
Optimization with ITU / TI Intrinsics Performance for generic code bases
Energy calculation loop 1000 looping, using L_mac ITU intrinsic #define ASIZE 1000 extern int a[ASIZE]; extern int red; void energy() { int i; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0; L_mac maps to one ConnX D2 instruction Compiler further optimizes by using SIMD to accelerate loop VLIW allows further accelerates with parallel loads 1000 loop C algorithm optimized to 500 cycles loop Sustained 3 operations / cycle entry a1,32 l32r a2,.LC1_40_18 l32r a5,.LC0_40_17 xd2_l.d16x2s.iu xdd0,a2,4 test_arr_1+0x0 l32i.n a3,a5,0 test_global_red_0+0x0 { # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 } loopgtz a3, { # format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 } Generated Assembly Code 19
20
“Out of the Box” Performance - Results
Comparison to TI C55x (TI C55x is an industry benchmark Dual-MAC, 2-way VLIW) 20% more performance (256 point complex FFT) Comparison to other DSP IP vendors Almost twice the performance Why better? FFT specific instructions Dual write to Register Files Advanced Complier SIMD and VLIW performance ConnX D2 "Out of the Box" C code TI C55x Optimized assembly Cycle count (lower is better) 3740 4786 # Why better? ConnX D2 (Out of the Box ITU reference code) CEVA - X1620 (Out of the Box ITU reference code) Required MHz for AMR-NB (VAD2) Encode + Decode 27.7 MHz 48 MHz * 1 to 1 mapping of ITU intrinsics SIMD and VLIW performance Flexibility in VLIW allocation VLIW Performance for scalar code * , From CEVA published Whitepaper # - Dec 2008, 20
21
Small, Low Power, & High Performance
Optimized for low area / low cost applications Less than 70,000 gates 0.18mm2 in 65nm GP * Low power 52uW/MHz power consumption 65nm GP, measured running AMR-NB algorithm Very high performance 600MHz in 65nm GP ** * - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option ** - After full Place and Route, when optimized for speed 21
22
Flexible and Customizable
Configure memory subsystems to exact requirements Up to 4 local memories Instruction memory, data memory RAM and ROM options DMA path into these memories Instruction and data cache configurations MMU and memory region protection Memory port interface Option of dual load/store architecture Full customization Instruction set extensions Custom I/O Interfaces TIE Ports, Queues and Lookup Memory interfaces ConnX DSP architecture is desgined for ease of use. Within the Xplorer tool the ConnX architecture is simply added by clicking the function within the GUI. Integration and optimization within the Xtensa hardware is automatic. The compiler and software tool chain is also automatically configured and non visible to the user. The compiler will automatically identify 22
23
ConnX D2 DSP Engine: Summary
Small size Low power Excellent performance on wide range of code Easy to use – C programming centric “Out of the Box” performance Reduce development time – reduced cost ITU and T.I. C intrinsic support – large existing code base Bit equivalent to TI C6x Take current TI code, port and get same functionality on ConnX D2 Flexible & customizable 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.