Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Processor Architecture Needed to handle FFT algoarithm M. Smith.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
6/2/20151 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Process for changing “C-based” design to SHARC assembler ADDITIONAL EXAMPLE M. R. Smith, Electrical and Computer Engineering University of Calgary, Canada.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact.
6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September.
Chapter 12 Pipelining Strategies Performance Hazards.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Ultra sound solution Impact of C++ DSP optimization techniques.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta,
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,
A first attempt at learning about optimizing the TigerSHARC code TigerSHARC assembly syntax.
Building a simple loop using Blackfin assembly code If you can handle the while-loop correctly in assembly code on any processor, then most of the other.
William Stallings Computer Organization and Architecture 8th Edition
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Software and Hardware Circular Buffer Operations
Microcoded CCU (Central Control Unit)
Program Flow on ADSP2106X SHARC Pipeline issues
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
ENCM K Interrupts Theory and Practice
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
ENCM515 Standard and Custom FIR filters for Lab. 4
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
M. R. Smith, University of Calgary, Canada ucalgary.ca
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
-- Tutorial A tool to assist in developing parallel ADSP2106X code
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* From AMD 1996 Publication #18522 Revision E
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
* 2000/08/1307/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these.
Getting serious about “going fast” on the TigerSHARC
General Optimization Issues
Explaining issues with DCremoval( )
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
General Optimization Issues
Tutorial on Post Lab. 1 Quiz Practice for parallel operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
A first attempt at learning about optimizing the TigerSHARC code
Working with the Compute Block
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
ENCM515 Standard and Custom FIR filters
Presentation transcript:

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 2 / 37 To be tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

ADSP-2106x -- Parallelism opportunities Ability for parallel memory operation, One each on pm, dm and instruction cache busses Memory pointer operations Post modify 2 index registers Automatic circular buffer operations Automatic bit reverse addressing Many parallel operations and register to register bus transfers Rn = Rx + Ry or Rn = Rx * Ry Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr Zero overhead loops Instruction pipeline issues Key issue -- Only 48? bits available in OPCODE to describe 16 data registers in 3 destinations and 6 sources = 135 bits 2 * (8 index + 8 modify + 16 data) = 64 bits Condition code selection, 32 bit constants etc.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 4 / 37 Basic code development -- any system Write the “C” code for the function void Convert(float *temperature, int N) which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market) Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 5 / 37 Parallel Instruction Code Development Write the 21k assembly code for the function void Convert(float *temperature, int N) which etc…... Determine the instruction flow through the architecture using a resource usage diagram Theoretically optimize the code -- a 2 minute counting process Compare and contrast the amount of time to perform the subroutine before and after customization.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 6 / 37 Standard “C” code void Convert(float *temperature, int N) { int count; for (count = 0; count < N; count++) { *temperature = (*temperature) * 9 / ; temperature++ } Standard Warning -- What does optimizing compiler do with 9 / 5 becomes 1 or 1.8?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 7 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 8 / style load/store “C” code void Convert(register float *temperature, register int N) { register int count; register float *pt = temperature; register float scratch; for (count = 0; count < N; count++) { scratch = *pt; scratch = scratch * (9 / 5); scratch = scratch + 32; *pt = scratch; pt++; }

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 9 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 10 / 37 Straight conversion -- PROLOGUE // void Convert( reg float *temperature, reg int N ) {.segment/pm seg_pmco;.global _Convert; _Convert: // register int count = GARBAGE; #define countR1 scratchR1 //register float *pt = temperature; #define pt scratchDMpt pt = INPAR1; //float scratch = GARBAGE; #define scratchF2 F2 // For the CURRENT code -- no volatile // registers are needed -- may not remain true

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 11 / 37 Straight conversion of code //for (count = 0; count < N; count++) { LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: //scratch = *pt; scratchF2 = dm(0, pt);// Not ++ as pt re-used // scratch = scratch * (9 / 5); // INPAR1 (R4) is dead -- can reuse as F4 #define constantF4 F4// Must be float constantF4 = 1.8 // No division, Use register constant scratchF2 = scratchF2 * constantF4; // scratch = scratch + 32; #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; // *pt = scratch; pt++; dm(pt, 1) = scratchF2; LOOP_END: 5 magic lines of code // NOT F0 = 32 gives F0 = 1 *

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 12 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach. Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 13 / 37 Speed rules IF you want adds and multiplys to occur on the same line F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description (bits) Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this to be possible Register description (bits) -- other bits “understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 14 / 37 When should we worry about the register assignment? #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM);// Not ++ as to be re-used // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; dm(pt, plus1DM) = F0_32; LOOP_END:

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 15 / 37 Check on required register use #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0// Must be float F0_32 = 32.0; scratchF2 = scratchF2 + F0_32; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 16 / 37 Register re-assignment -- Step 1 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 -- OKAY LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4// Must be float -- OKAY constantF4 = 1.8; scratchF2 = scratchF2 * constantF4 -- SOURCES okay here Fn = F(0,1,2 or 3) * F(4,5,6 or 7), #define F0_32 F0// Must be float F0_32 = 32.0; -- WRONG to use F0 here -- ADDITION scratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST early Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2; -- OKAY

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 17 / 37 Register re-assignment -- Step 2 #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12// INPAR3 is available F12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 18 / 37 Fix poor coding practice -- “C” or assembly #define count scratchR1 #define pt scratchDMpt #define scratchF2 F2 LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE: scratchF2 = dm(pt, zeroDM); // INPAR1 (R4) is dead -- can reuse #define constantF4 F4// Must be float constantF4 = 1.8; MOVE OUTSIDE LOOP scratchF8 = scratchF2 * constantF4 answer must be in F(8, 9, 10 or 11) #define F12_32 F12// INPAR3 is available F12_32 = 32.0; MOVE OUTSIDE LOOP scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 19 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 20 / 37 Resource Management -- Chart1 -- Basic code LOOPEND: -1 UNTIL LCE In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource dm 2 cycle loop possible Before proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 21 / 37 Process for developing parallel code Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture Write the assembly code using a hardware loop Check that end of loop label is in the correct place Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point. Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques Attempt to -- watch out for special situations where code will fail Compare and contrast time -- setup and loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 22 / 37 Resource 2 -- unroll the loop -- 5 times here Each pass through the loop involves Read Multiply Add Write

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 23 / 37 Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions Model used -- depends on where operands are relative to equals sign ‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU THESE PHASES ARE ‘CONCEPTS’ RATHER THAN “ IMPLEMENTED’ Reading

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 24 / 37 Resource Management 4 Check what can be moved in parallel with other instructions OKAY TO MOVE F2 src freed up before F2 dest occurs OKAY TO MOVE Empty spot if can move * and + instructs which this instruction MUST follow NO !!! or just possible NO? Why a problem? F2 =

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 25 / 37 Memory resource availability Move up F2 = dm(pt, ZERODM) from second loop into first loop However now we have a possible conflict about which F2 should be used for the dm(pt, plus1DM) = F2 instruction if we further optimize by trying to fill the other empty delay slots -- see next slide

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 26 / 37 Resource management Overlapping two parts of the loop

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 27 / 37 Resource Management 5 -- What’s up, Doc? Attempting to fill all unused resource availability Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers Problem may/may not exist with this simple example but very likely to exist in more complex algorithm

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 28 / 37 Resource 6 -- Solution -- Save and then use F9

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 29 / 37 Resource Management 7 -- Some parallelism possible with Read, Mult, Add and Write mixed across 5 loop comps. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 30 / 37 WRONG -- CONCEPT GOOD, IMPLEMENTATION BAD as we are no longer indexing correctly through the data. Problem 1 -- No resource in maximum usage -- code in-efficient Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 31 / 37 Resource Management 8 Unroll the loop a bit more -- 9 loop components DM BUS USAGE NOW MAXed OUT (after a while) CODE PATTERN APPEARING

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 32 / 37 Resource Management 9 Identify the loop components LOOP BODY FILL ALU/FPU PIPE EMPTY ALU/FPU PIPELINE

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 33 / 37 Resource 9 -- Final code version -1 UNTIL LCE LOOPEND : FILL USE EMPTY ALU/FPU PIPE

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 34 / 37 Speed improvements BEFORE START LOOP EXIT ENTRY 4 + N* = * N NOW with 2-fold loop unfolding START LOOP EXIT ENTRY (N – 2) * 5 / = * N NOW with 3-fold loop unfolding START LOOP EXIT ENTRY (N – 2) * 6 / = * N Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 35 / 37 Question to Ask We now know the final code Should we have made the substitution F2 to F9? Who cares -- do it anyway as more likely to be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably overlap the save and recovery of the non-volatile R9 with other instructions! Will the code work?

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 36 / 37 Resource 9 -- Final code version -1 UNTIL LCE LOOPEND : N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 Only works if (N - 2) / 3 is an integer.

6/2/2015 ENCM Systematic development of parallel instructions on SHARC ADSP21061 Copyright 37 / 37 Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism