* M. R. Smith smithmr@ucalgary.ca 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Slides:



Advertisements
Similar presentations
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Advertisements

1 Advanced Computer Architecture Limits to ILP Lecture 3.
Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.
1 Analog Devices TigerSHARC® DSP Family Presented By: Mike Lee and Mike Demcoe Date: April 8 th, 2002.
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact.
6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
Chapter 12 CPU Structure and Function. Example Register Organizations.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
2000/03/051 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Science Program Center for Entrepreneurship and Information Technology, Louisiana Tech University This presentation will probably involve audience.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Real-World Pipelines Idea Divide process into independent stages
Advanced Architectures
William Stallings Computer Organization and Architecture 8th Edition
Digital Signal Processors
Final Project Presentation
Instruction Level Parallelism and Superscalar Processors
واشوقاه إلى رمضان مرحباً رمضان
Pipelining review.
TigerSHARC processor General Overview.
Microcoded CCU (Central Control Unit)
Program Flow on ADSP2106X SHARC Pipeline issues
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Pipelining in more detail
Trying to avoid pipeline delays
ENCM K Interrupts Theory and Practice
Understanding the TigerSHARC ALU pipeline
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Understanding the TigerSHARC ALU pipeline
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Overview of TigerSHARC processor ADSP-TS101 Compute Operations
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* From AMD 1996 Publication #18522 Revision E
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
* L. E. Turner and M. R. Smith, University of Calgary, Alberta, Canada
General Optimization Issues
General Optimization Issues
Lab. 4 – Part 2 Demonstrating and understanding multi-processor boot
Tutorial on Post Lab. 1 Quiz Practice for parallel operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Chapter 11 Processor Structure and function
ADSP 21065L.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Presentation transcript:

* M. R. Smith smithmr@ucalgary.ca 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Processor Comparison M. R. Smith smithmr@ucalgary.ca 9/19/2019 *

Processor comparisojn TigerSHARC Processor Architecture 9/19/2019 Processor comparisojn

Processor comparisojn SHARC DAG 2 8 x 4 x 24 DAG 1 8 x 4 x 32 CACHE MEMORY 32 x 48 PROGRAM SEQUENCER PMD BUS DMD BUS 24 PMA BUS PMD DMD PMA 32 DMA BUS DMA 48 40 JTAG TEST & EMULATION FLAGS FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT ACCUMULATOR 32-BIT BARREL SHIFTER FLOATING-POINT & FIXED-POINT ALU REGISTER FILE 16 x 40 BUS CONNECT TIMER 9/19/2019 Processor comparisojn

Comparison -- Data registers TigerSHARC ADSP-201 Data registers 32 XR (float or integer) 32 YR (float or integer) Always in MIMD mode SHARC ADSP-21XXX Data registers 16 R (float or integer) Can be switched to SIMD 16 S (float or integer) 9/19/2019 Processor comparisojn

Comparison – Address registers TigerSHARC ADSP-201 Address registers 32 J (J-Bus) 32 K (K-Bus) 4 CB operations Instruction line R6 = [J1 += J4]; R8 = [K1 += K4];; SHARC ADSP-21XXX Address registers I0 – I7 (dm bus) + M0 – M7 I8 – I15 (pm bus) + M8 – M15 16 CB operations Instruction ActivateSISD; R6 = dm(I1, M4), R8 = pm(I9, M13); 9/19/2019 Processor comparisojn

Comparison – Compute Pipeline TigerSHARC 10 stage pipeline 4 instruction, 4 J-IALU + 2 Compute ALU stages Consequences R2 = [J2 += J12];; Stall R3 = R2 * R1; Stall R5 = R4 + R3; SHARC 3 stage pipeline 1 instruction, 1 memory + 1 compute ALU Consequences R2 = dm(I2, M2); NO stall R3 = R2 * R1; NO stall R5 = R4 + R3; Has other consequences too 9/19/2019 Processor comparisojn

Comparison – Memory access TigerSHARC 6 memory blocks Blocks accessed by 3 busses J – Bus K – Bus Instruction Bus Avoids data – instruction fetch clashes Avoids data – data fetch clashes SHARC 2 memory block Blocks accessed by 2 busses Dm – bus Pm – bus (instruction) Avoids data – instruction fetch clashes How does this architecture avoid data – data fetch classes? 9/19/2019 Processor comparisojn

Processor comparisojn SHARC -- 3 memory block Most instructions use 1 instruction fetch plus only 0 or 1 data fetch – 2 busses sufficient Have a separate (small) instruction cache that fills with the instructions that need 1 instruction + 2 data fetches DAG 2 8 x 4 x 24 DAG 1 8 x 4 x 32 CACHE MEMORY 32 x 48 PROGRAM SEQUENCER PMD BUS DMD BUS 24 PMA BUS PMD DMD PMA 32 DMA BUS DMA 48 40 JTAG TEST & EMULATION FLAGS FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT ACCUMULATOR 32-BIT BARREL SHIFTER FLOATING-POINT & FIXED-POINT ALU REGISTER FILE 16 x 40 BUS CONNECT TIMER 9/19/2019 Processor comparisojn

Dual data accesses on SHARC R2 = dm(I2, M2), R3 = pm(I9, M9); R4 = R2 + R3; Instruction fetch clashes with data fetch of previous instruction – put instruction into instruction cache 9/19/2019 Processor comparisojn

Dual data accesses on SHARC Start loop: R2 = dm(I2, M2), R3 = pm(I9, M9); R4 = R2 + R3; Instruction fetch clashes with data fetch of previous instruction first time round the loop – put instruction into instruction cache End_loop: BUT second time round the loop the instruction is in the instruction cache so that 3 busses are available and there is no stall Consequence: If loop is large – e.g. viteribi algorithm – then may have many dual data accesses – this means new dual access instruction placed into instruction cache causes old one to be thrown out. Next time around the loop, the old instruction is put back into the cache, and the new one is thrown out – cache thrash occurs and no speed savings are gained 9/19/2019 Processor comparisojn

Processor comparisojn Hardware loops TigerSHARC 2 hardware loops available SHARC 6 hardware loops available Not as useful as it sounds Only the “inner loop” is executed often, so that is the only one where loop overhead is really important Can’t have loops ending on same instruction – so need to add nops 9/19/2019 Processor comparisojn

Processor comparisojn Jumps and pipeline TigerSHARC 10 stage pipeline Non predicted branch causes many instruction fetches and execution stages to be thrown away Partially solved by having ability to chose between “predicted” and “non-predicted” branches. Heavy penalty when the “other choice” is taken Can’t use delayed branch concept as instructions are always discarded Most instructions are made conditional SHARC 3 stage pipeline Non predicted branch causes 2 instruction fetches + 1 execution to be thrown away Use delayed branch concept Two instructions after the branch are ALWAYS executed. If you can’t find useful instructions to put in “delay slots” then put NOP’s 9/19/2019 Processor comparisojn

BDTI – good source of info www.bdti.com/bdtimark/chip_float_scores.pdf 9/19/2019 Processor comparisojn

BDTI – good source of info www.bdti.com/bdtimark/ 9/19/2019 Processor comparisojn

BDTI – good source of info 9/19/2019 Processor comparisojn

BDTI – good source of info 9/19/2019 Processor comparisojn