1 Analog Devices TigerSHARC® DSP Family Presented By: Mike Lee and Mike Demcoe Date: April 8 th, 2002.

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

Is There a Real Difference between DSPs and GPUs?
DSPs Vs General Purpose Microprocessors
PIPELINE AND VECTOR PROCESSING
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.
Overview of Popular DSP Architectures: TI, ADI, Motorola R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Blackfin ADSP Versus Sharc ADSP-21061
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
1 Architectural Analysis of a DSP Device, the Instruction Set and the Addressing Modes SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
Alyssa Concha Microprocessors Final Project ADSP – SHARC Digital Signal Processor.
Chapter 12 CPU Structure and Function. Example Register Organizations.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
2000/03/051 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
PowerPC 601 Stephen Tam. To be tackled today Architecture Execution Units Fixed-Point (Integer) Unit Floating-Point Unit Branch Processing Unit Cache.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
An introduction to Digital Signal Processors (DSP) Using the C55xx family.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Using Analog Devices’ Blackfin for Embedded Processing Diana Franklin and John Seng.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Principles of Linear Pipelining
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
IBM System/360 Matt Babaian Nathan Clark Paul DesRoches Jefferson Miner Tara Sodano.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Introduction to Computer Organization Pipelining.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
Assembly language.
Instruction Level Parallelism
ARM Organization and Implementation
Central Processing Unit Architecture
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
CDA 3101 Spring 2016 Introduction to Computer Organization
Digital Signal Processors
Subject Name: Digital Signal Processing Algorithms & Architecture
Subject Name: Digital Signal Processing Algorithms & Architecture
TigerSHARC processor General Overview.
* From AMD 1996 Publication #18522 Revision E
Digital Signal Processors-1
Understanding the TigerSHARC ALU pipeline
ADSP 21065L.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

1 Analog Devices TigerSHARC® DSP Family Presented By: Mike Lee and Mike Demcoe Date: April 8 th, 2002

2 TigerSHARC Architectural Overview High performance, 128-bit successor to the ADSP-2106x SHARC family ADSP-TS101S, the newest TigerSHARC DSP, operates at 250MHz! Multiple computational units  Two compute blocks, each containing a register file, ALU, multiplier, and shifter.  Two additional integer ALUs  Two hardware loop counter registers Can execute up to four independent 32-bit instructions at a time  Or, eight 16-bit instructions Very wide word widths for high precision arithmetic Designed to be used in a multiple processor environment

3 TigerSHARC Architecture Overview (cont…) BTB (Branch Target Buffer) as a means of alleviating issues with the deep pipeline  32-instruction, 4-way set-associative cache  User controlled Branch Prediction Three, 128-bit blocks of memory which provide access to a program and two data operands without causing instruction/data conflicts. Load-store, Harvard architecture, like SHARC. Native support for complex number instructions

4 The TS101S Architecture

5 Details of Multiple Compute Blocks Two computational units, each containing:  Register file – Multi-ported to allow multiple accesses to registers in a single clock cycle General purpose registers! Contains 32 words, each word being 32-bits in length.  ALU – Fixed-point and floating point  Multiplier – Fixed-point and floating point Also features MAC (multiply-and-accumulate) capabilities  Shifter – Standard logical and arithmetic shifts as well as bit manipulation

6 The TS101S Pipeline Fetch 1 Fetch 2 Fetch 3 Integer Access Execute 1 Execute 2 Decode IAB Fetch Stages Execution Stages

7 Pipelines and Instruction Related Information ADSP  Three stage pipeline  20ns instruction cycle  SISD but can put instructions in parallel ADSP-TS101S  Eight stage pipeline with IAB  4ns instruction cycle  MIMD and can also put instructions in parallel

8 Loops, Branching and Timers ADSP  Zero-overhead hardware loop support  Delayed Branching  One timer ADSP-TS101S  Little support for zero-overhead hardware loops  32-entry 4-way associative BTB cache with Branch prediction  Two timers

9 Memory and Buses ADSP  1 Mbit dual ported SRAM  Shared by three buses (PM, DM, I/O)  PM and DM share a port while the I/O receives it’s own ADSP-TS101S  6 Mbit of SRAM (Quad Ported??)  User defined partitions  Each block is accessed by one 128-bit bus

10 Multiplication and other Nifty Tricks ADSP  MAC instructions (MRF and MRB)  Various precision output (32, 40, or 80 bit) ADSP-TS101S  Each compute block has it’s own set of MAC registers  8 16-bit MAC with 40-bit accumulation or 2 32-bit MAC with 80-bit accumulation  Complex number MAC instructions  128-bit accelerator Trellis decoding (8 Trellis butterflies per cycle)

11 Data Address Generation ADSP  2 data address generation units (DAGS)  8 circular buffers per DAG ADSP-TS101S  2 data address generation units (IALU)  4 circular buffers per IALU Both support modulo arithmetic, bit reversal addressing, and post and pre-modify instructions

12 Ease of Use ADSP  Easy to use  Algebraic instruction set  Visual DSP environment ADSP-TS101S  Similar to but know have to consider 2 compute blocks  ADI suggests leaving parallelization to their optimizing compiler  Visual DSP environment

13 Specific DSP Algorithms and the TigerSHARC In ENEL515 (and/or related articles) we’ve studied the FIR, IIR, and FFT algorithms TigerSHARC has a massively parallel architecture that is tailored to performing these algorithms.

14 FIR Filter Characteristics Think back (or forward, depending on how much you’ve procrastinated) to Lab #3. FIR Characteristics  Simple, long loop  Repetitive calculations (multiply, then add!)  Access to an array of coefficients, and an array of “delay-line” values  Few data dependency issues during the calculation of a single output For a filter of length N, require N multiplications and N adds to obtain a single output value.

15 TigerSHARC and the FIR Filter The general idea is: Divide and conquer! Take a filter of size N and split it into two groups of N/2  Utilize the TigerSHARC’s multiple computational units and MAC instructions to perform the algorithm in ½ the time (plus some overhead) Two hardware loop counters to simultaneously control the two new “N/2” size FIR loops with no overhead! Can do all of the following SIMULTANEOUSLY!  Fetch two operands (one coefficient, one delay line value) from two separate memory banks  Fetch the next instruction  Perform arithmetic operations on the PREVIOUS operands! Unlike SHARC, instruction/data clashes are non-existant due to the numerous bus paths linking computational units to memory space

16 TigerSHARC and the FIR Filter (continued….) 8-cycle-deep pipeline  Stalls are expensive..  Branch Target Buffer reduces performance loss that results from branching in a deeply pipelined processor The long loop characteristic of the FIR filter algorithm allows us to keep the 8-cycle-deep pipeline full  Full pipeline means fast algorithm FIR Filter algorithms rely heavily on data sets that are aligned in memory  Post-increment is your friend  TigerSHARC Quad Data Accesses – Supply four aligned words to one compute block or two aligned words to each compute block.

17 Example Instructions X/Y Conditional Compute if xALE; do, R0=R1+R2 Condition codes, AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1. A = Adder, M = Multiplier, S = Shifter Memory Addessing Indirect post-modify with update, register offset: YR20=[J1+=J2] Indirect post-modify with update, 8-bit immediate offset: Q[K1+=0xF8]=XYR3:0 Indirect pre-modify no update, register offset: J3:2=L[K1+K2] Indirect pre-modify no update, immediate offset: YR3:2=L[K1+0x ] Complex Quad 16-bit Fixed Point Multiplication Instructions {X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})} {X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})}

18 FIR Code Example

19 TigerSHARC and the IIR Filter Short, simple loop characteristic  Means loop overhead is more of a concern  Means keeping the pipeline full is tougher! Time to unroll the loop, although ADI says to let VisualDSP do it for you. Again, split up the calculations on an N-tap IIR filter into two N/2 sets operating simultaneously  Idea: One computational block does feedforward calculations, one does feedback! Complex numbers commonly required  Hardware support for complex MAC in TigerSHARC Again, Quad Data Access comes in handy for aligned data Post-increment is still your friend

20 TigerSHARC and the FFT Does not use the same MAC modes that IIR and FIR filters do. Requires more complicated addressing modes  Example: Bit reverse addressing Found on both SHARC and TigerSHARC Difficult to split onto separate computational units and even more difficult to split amongst distributed processors Requires large arrays of complex variables and fixed coefficients  Hardware complex number MAC comes in handy again!  Large arrays of aligned data – Quad Data access again! Requires HIGH-PRECISION arithmetic  Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating point arithmetic.  80-bit MAC precision FFT Requires many intermediate values  32 GP registers in a single computational block

21

22

23 Conclusion TigerSHARC have a very SHARC-like architecture, except it’s MUCH more complex.  Highly optimized for parallelism Major features: Complex number support, multiple computational units, high instruction throughput, wider buses. Performs DSP algorithms including FIR, IIR, FFT significantly faster than SHARC!

24 References ( ) 10. ADSP-2106x SHARC User’s Manual, Second Edition ( )

25 Note from Dr. Smith Information on Burg algorithm outside ICT536. It is essentially an FIR filter used for prediction (i.e. what FIR coefficients are needed so that the filtered signal is "white noise" )