* M. R. Smith smithmr@ucalgary.ca 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Processor Comparison M. R. Smith smithmr@ucalgary.ca 9/19/2019 *
Processor comparisojn TigerSHARC Processor Architecture 9/19/2019 Processor comparisojn
Processor comparisojn SHARC DAG 2 8 x 4 x 24 DAG 1 8 x 4 x 32 CACHE MEMORY 32 x 48 PROGRAM SEQUENCER PMD BUS DMD BUS 24 PMA BUS PMD DMD PMA 32 DMA BUS DMA 48 40 JTAG TEST & EMULATION FLAGS FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT ACCUMULATOR 32-BIT BARREL SHIFTER FLOATING-POINT & FIXED-POINT ALU REGISTER FILE 16 x 40 BUS CONNECT TIMER 9/19/2019 Processor comparisojn
Comparison -- Data registers TigerSHARC ADSP-201 Data registers 32 XR (float or integer) 32 YR (float or integer) Always in MIMD mode SHARC ADSP-21XXX Data registers 16 R (float or integer) Can be switched to SIMD 16 S (float or integer) 9/19/2019 Processor comparisojn
Comparison – Address registers TigerSHARC ADSP-201 Address registers 32 J (J-Bus) 32 K (K-Bus) 4 CB operations Instruction line R6 = [J1 += J4]; R8 = [K1 += K4];; SHARC ADSP-21XXX Address registers I0 – I7 (dm bus) + M0 – M7 I8 – I15 (pm bus) + M8 – M15 16 CB operations Instruction ActivateSISD; R6 = dm(I1, M4), R8 = pm(I9, M13); 9/19/2019 Processor comparisojn
Comparison – Compute Pipeline TigerSHARC 10 stage pipeline 4 instruction, 4 J-IALU + 2 Compute ALU stages Consequences R2 = [J2 += J12];; Stall R3 = R2 * R1; Stall R5 = R4 + R3; SHARC 3 stage pipeline 1 instruction, 1 memory + 1 compute ALU Consequences R2 = dm(I2, M2); NO stall R3 = R2 * R1; NO stall R5 = R4 + R3; Has other consequences too 9/19/2019 Processor comparisojn
Comparison – Memory access TigerSHARC 6 memory blocks Blocks accessed by 3 busses J – Bus K – Bus Instruction Bus Avoids data – instruction fetch clashes Avoids data – data fetch clashes SHARC 2 memory block Blocks accessed by 2 busses Dm – bus Pm – bus (instruction) Avoids data – instruction fetch clashes How does this architecture avoid data – data fetch classes? 9/19/2019 Processor comparisojn
Processor comparisojn SHARC -- 3 memory block Most instructions use 1 instruction fetch plus only 0 or 1 data fetch – 2 busses sufficient Have a separate (small) instruction cache that fills with the instructions that need 1 instruction + 2 data fetches DAG 2 8 x 4 x 24 DAG 1 8 x 4 x 32 CACHE MEMORY 32 x 48 PROGRAM SEQUENCER PMD BUS DMD BUS 24 PMA BUS PMD DMD PMA 32 DMA BUS DMA 48 40 JTAG TEST & EMULATION FLAGS FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT ACCUMULATOR 32-BIT BARREL SHIFTER FLOATING-POINT & FIXED-POINT ALU REGISTER FILE 16 x 40 BUS CONNECT TIMER 9/19/2019 Processor comparisojn
Dual data accesses on SHARC R2 = dm(I2, M2), R3 = pm(I9, M9); R4 = R2 + R3; Instruction fetch clashes with data fetch of previous instruction – put instruction into instruction cache 9/19/2019 Processor comparisojn
Dual data accesses on SHARC Start loop: R2 = dm(I2, M2), R3 = pm(I9, M9); R4 = R2 + R3; Instruction fetch clashes with data fetch of previous instruction first time round the loop – put instruction into instruction cache End_loop: BUT second time round the loop the instruction is in the instruction cache so that 3 busses are available and there is no stall Consequence: If loop is large – e.g. viteribi algorithm – then may have many dual data accesses – this means new dual access instruction placed into instruction cache causes old one to be thrown out. Next time around the loop, the old instruction is put back into the cache, and the new one is thrown out – cache thrash occurs and no speed savings are gained 9/19/2019 Processor comparisojn
Processor comparisojn Hardware loops TigerSHARC 2 hardware loops available SHARC 6 hardware loops available Not as useful as it sounds Only the “inner loop” is executed often, so that is the only one where loop overhead is really important Can’t have loops ending on same instruction – so need to add nops 9/19/2019 Processor comparisojn
Processor comparisojn Jumps and pipeline TigerSHARC 10 stage pipeline Non predicted branch causes many instruction fetches and execution stages to be thrown away Partially solved by having ability to chose between “predicted” and “non-predicted” branches. Heavy penalty when the “other choice” is taken Can’t use delayed branch concept as instructions are always discarded Most instructions are made conditional SHARC 3 stage pipeline Non predicted branch causes 2 instruction fetches + 1 execution to be thrown away Use delayed branch concept Two instructions after the branch are ALWAYS executed. If you can’t find useful instructions to put in “delay slots” then put NOP’s 9/19/2019 Processor comparisojn
BDTI – good source of info www.bdti.com/bdtimark/chip_float_scores.pdf 9/19/2019 Processor comparisojn
BDTI – good source of info www.bdti.com/bdtimark/ 9/19/2019 Processor comparisojn
BDTI – good source of info 9/19/2019 Processor comparisojn
BDTI – good source of info 9/19/2019 Processor comparisojn