1 Analog Devices TigerSHARC® DSP Family Presented By: Mike Lee and Mike Demcoe Date: April 8 th, 2002
2 TigerSHARC Architectural Overview High performance, 128-bit successor to the ADSP-2106x SHARC family ADSP-TS101S, the newest TigerSHARC DSP, operates at 250MHz! Multiple computational units Two compute blocks, each containing a register file, ALU, multiplier, and shifter. Two additional integer ALUs Two hardware loop counter registers Can execute up to four independent 32-bit instructions at a time Or, eight 16-bit instructions Very wide word widths for high precision arithmetic Designed to be used in a multiple processor environment
3 TigerSHARC Architecture Overview (cont…) BTB (Branch Target Buffer) as a means of alleviating issues with the deep pipeline 32-instruction, 4-way set-associative cache User controlled Branch Prediction Three, 128-bit blocks of memory which provide access to a program and two data operands without causing instruction/data conflicts. Load-store, Harvard architecture, like SHARC. Native support for complex number instructions
4 The TS101S Architecture
5 Details of Multiple Compute Blocks Two computational units, each containing: Register file – Multi-ported to allow multiple accesses to registers in a single clock cycle General purpose registers! Contains 32 words, each word being 32-bits in length. ALU – Fixed-point and floating point Multiplier – Fixed-point and floating point Also features MAC (multiply-and-accumulate) capabilities Shifter – Standard logical and arithmetic shifts as well as bit manipulation
6 The TS101S Pipeline Fetch 1 Fetch 2 Fetch 3 Integer Access Execute 1 Execute 2 Decode IAB Fetch Stages Execution Stages
7 Pipelines and Instruction Related Information ADSP Three stage pipeline 20ns instruction cycle SISD but can put instructions in parallel ADSP-TS101S Eight stage pipeline with IAB 4ns instruction cycle MIMD and can also put instructions in parallel
8 Loops, Branching and Timers ADSP Zero-overhead hardware loop support Delayed Branching One timer ADSP-TS101S Little support for zero-overhead hardware loops 32-entry 4-way associative BTB cache with Branch prediction Two timers
9 Memory and Buses ADSP 1 Mbit dual ported SRAM Shared by three buses (PM, DM, I/O) PM and DM share a port while the I/O receives it’s own ADSP-TS101S 6 Mbit of SRAM (Quad Ported??) User defined partitions Each block is accessed by one 128-bit bus
10 Multiplication and other Nifty Tricks ADSP MAC instructions (MRF and MRB) Various precision output (32, 40, or 80 bit) ADSP-TS101S Each compute block has it’s own set of MAC registers 8 16-bit MAC with 40-bit accumulation or 2 32-bit MAC with 80-bit accumulation Complex number MAC instructions 128-bit accelerator Trellis decoding (8 Trellis butterflies per cycle)
11 Data Address Generation ADSP 2 data address generation units (DAGS) 8 circular buffers per DAG ADSP-TS101S 2 data address generation units (IALU) 4 circular buffers per IALU Both support modulo arithmetic, bit reversal addressing, and post and pre-modify instructions
12 Ease of Use ADSP Easy to use Algebraic instruction set Visual DSP environment ADSP-TS101S Similar to but know have to consider 2 compute blocks ADI suggests leaving parallelization to their optimizing compiler Visual DSP environment
13 Specific DSP Algorithms and the TigerSHARC In ENEL515 (and/or related articles) we’ve studied the FIR, IIR, and FFT algorithms TigerSHARC has a massively parallel architecture that is tailored to performing these algorithms.
14 FIR Filter Characteristics Think back (or forward, depending on how much you’ve procrastinated) to Lab #3. FIR Characteristics Simple, long loop Repetitive calculations (multiply, then add!) Access to an array of coefficients, and an array of “delay-line” values Few data dependency issues during the calculation of a single output For a filter of length N, require N multiplications and N adds to obtain a single output value.
15 TigerSHARC and the FIR Filter The general idea is: Divide and conquer! Take a filter of size N and split it into two groups of N/2 Utilize the TigerSHARC’s multiple computational units and MAC instructions to perform the algorithm in ½ the time (plus some overhead) Two hardware loop counters to simultaneously control the two new “N/2” size FIR loops with no overhead! Can do all of the following SIMULTANEOUSLY! Fetch two operands (one coefficient, one delay line value) from two separate memory banks Fetch the next instruction Perform arithmetic operations on the PREVIOUS operands! Unlike SHARC, instruction/data clashes are non-existant due to the numerous bus paths linking computational units to memory space
16 TigerSHARC and the FIR Filter (continued….) 8-cycle-deep pipeline Stalls are expensive.. Branch Target Buffer reduces performance loss that results from branching in a deeply pipelined processor The long loop characteristic of the FIR filter algorithm allows us to keep the 8-cycle-deep pipeline full Full pipeline means fast algorithm FIR Filter algorithms rely heavily on data sets that are aligned in memory Post-increment is your friend TigerSHARC Quad Data Accesses – Supply four aligned words to one compute block or two aligned words to each compute block.
17 Example Instructions X/Y Conditional Compute if xALE; do, R0=R1+R2 Condition codes, AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1. A = Adder, M = Multiplier, S = Shifter Memory Addessing Indirect post-modify with update, register offset: YR20=[J1+=J2] Indirect post-modify with update, 8-bit immediate offset: Q[K1+=0xF8]=XYR3:0 Indirect pre-modify no update, register offset: J3:2=L[K1+K2] Indirect pre-modify no update, immediate offset: YR3:2=L[K1+0x ] Complex Quad 16-bit Fixed Point Multiplication Instructions {X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})} {X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})}
18 FIR Code Example
19 TigerSHARC and the IIR Filter Short, simple loop characteristic Means loop overhead is more of a concern Means keeping the pipeline full is tougher! Time to unroll the loop, although ADI says to let VisualDSP do it for you. Again, split up the calculations on an N-tap IIR filter into two N/2 sets operating simultaneously Idea: One computational block does feedforward calculations, one does feedback! Complex numbers commonly required Hardware support for complex MAC in TigerSHARC Again, Quad Data Access comes in handy for aligned data Post-increment is still your friend
20 TigerSHARC and the FFT Does not use the same MAC modes that IIR and FIR filters do. Requires more complicated addressing modes Example: Bit reverse addressing Found on both SHARC and TigerSHARC Difficult to split onto separate computational units and even more difficult to split amongst distributed processors Requires large arrays of complex variables and fixed coefficients Hardware complex number MAC comes in handy again! Large arrays of aligned data – Quad Data access again! Requires HIGH-PRECISION arithmetic Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating point arithmetic. 80-bit MAC precision FFT Requires many intermediate values 32 GP registers in a single computational block
21
22
23 Conclusion TigerSHARC have a very SHARC-like architecture, except it’s MUCH more complex. Highly optimized for parallelism Major features: Complex number support, multiple computational units, high instruction throughput, wider buses. Performs DSP algorithms including FIR, IIR, FFT significantly faster than SHARC!
24 References ( ) 10. ADSP-2106x SHARC User’s Manual, Second Edition ( )
25 Note from Dr. Smith Information on Burg algorithm outside ICT536. It is essentially an FIR filter used for prediction (i.e. what FIR coefficients are needed so that the filtered signal is "white noise" )