Download presentation
Presentation is loading. Please wait.
Published byAndreas Axelsson Modified over 6 years ago
1
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Parallel instructions on a DSP processor. What’s allowed, what’s not and why not? M. Smith, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca *
2
Overview You have all these wonderful resources to run in parallel in the latest DSP architectures. More parallel -- means more speed? Must also take into account the physical limitations in any system 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
3
To be tackled Limitations of instruction sets -- Why needed?
CISC processor example Recognizing possible limitations in the instruction set of SHARC processor Standard operations Memory accesses -- parallel and non-parallel Parallel COMPUTE instructions Parallel COMPUTE instructions with multiple memory accesses 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
4
Why are there instruction limitations?
Data bus size For example bit long words, single 16 bit bus for fetching data and instructions Immediate implication bits easy to fetch, 32 bits slower as need multiple fetches. Immediate implication -- Faster instruction execution if can describe all necessary information within the first 16-bits fetched Try to arrange for most commonly used instructions to be described within 16-bit opcode. Immediate implication -- Conflicts with instruction fetches when have to fetch lots of data Speed of fetches from memory Where to obtain efficiency? Real or imagined 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
5
68K Branch instruction efficiency -- BRA.S
Want to work in 1 FETCH bits available to describe all aspects of the instruction’s operation 4 bits taken up to say THIS IS A BRANCH INSTRUCTION and am not something else 4 bits taken up for 16 types of tests possible on this CPU This means ONLY 8 bits left to describe displacement Jump location = PC + 127/PC - 128 Displacement is automatically sign extended to 32 bits If want to branch to an instruction further than from current PC then need slower instruction with fetches of 16-bit opcode + 16-bit displacement BRA.S LOCATION --> ADD.L #(LOCATION32 - CURRENTPC8), PC 1 1 C C C C P P P P P P P P 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
6
68k Efficient Move -- MOVEQ.L
Want instruction to work in 1 FETCH bits available to describe all aspects of operation 5 bits taken up to say MOVEQ instruction and not something else 3 bits taken up for 8 possible data registers 8 bits left to specify the value Value = to - 128 Value is automatically sign extended to 32 bits MOVEQ.L #64, D0 1 1 1 D D D P P P P P P P P 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
7
68k -- Efficient Add -- ADDQ.L
Want to work in 1 FETCH bits available to describe operation 5 bits taken up to say I’M AN ADDQ instruction and 2 to say -- byte/word/long operation 6 bits taken up for describing Effective Address (addressing modes) e.g. MOVEQ.L #6, (A0)+ 3 bits left to describe the value Value = 1 to why not 0? Why not -128 to +127? Use SUBQ for negative values Single fetch (4 cycles) -- may take up to 28 cycles to complete 1 1 Q Q Q S S E E E E E E 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
8
Add binary coded decimal -- ABCD.B
Want to work in 1 FETCH bits available to describe operation -- very limited capability Only works on Bytes (effectively op-code is 18 bits long as most other instructions use 2-bits to distinguish .B, .W and .L) 9 bits taken up to say I’M AN ABCD instruction 6 bits taken up for two 8 possible data registers (or 1 data + 1 predecrement addressing operation) 1 bit left to select which operation is to be performed 0 means data register to data register 1 means memory access (predec) to register -- Need PREDECREMENT as must work from right to left for BCD values 1 1 R R R 1 M d d d 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
9
Maximum Efficiency for most used instruction
68K designers decided that the MOVE instruction was the most useful and gave it the most flexibility Many different types of MOVE operations possible S = 3 sizes of data moved -- .B, .W and .L R = 8 destination registers M = 8 different destination EA E = 64 different source EA (more than expected) Many EA’s require additional operations to fetch the complete opcode information (during decode in ENCM415 model of the phases of CPU operation) S S R R R M M M e e e e e e 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
10
Possible instruction formats -- 8-bit opcode
Bits to distinguish between instructions Instr ??????? 7 bits for other info Instr ?????? 6 bits for other info Instr ????? 5 bits for other info and other instructions possible OR Instr ?????? 6 bits for other info No other instructions available 68k like “first format” -- 21k like “second” 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
11
Same basic issues on SHARC
You can’t do everything with all possible resources Note -- you don’t need to “do everything” Decide what you want to do best and then customize the opcode to handle that operation High speed DSP processor applications 48 bit program data bus Need ? Bits to describe the type of operation Need ? Bits to describe memory operations Need ? Bits to describe ALU/FPU operations Look at data book for information 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
12
Compute/dreg<->DM/dreg<->PM
3 bits opcode see User Manual for more information 2 bits for direction memory op READ/WRITE on two busses. Query -- how handle “No memory ops needed”? -- Different OPCODE? 12 bits ONLY to describe 2 index registers and 2 modify registers DM_I, DM_M, PM_I, PM_M Many opcode bits are saved since order of registers in opcode is preset. 8 bits to describe which data registers used 16 possible registers used in DM and PM 23 bits to describe Compute operations Means many things can’t be done with certain legal DM ops 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
13
Restriction -- MANY Must be a DAG1 index register and a DAG2 index register -- NOT 2 DAG1 index registers Would need extra opcode bits, extra internal data paths Must be POST-MODIFY with register -- NOT premodify -- NOT constant modify Describing post/pre modify constants would take many bits and not permit parallel operations to be described. Host of restrictions on COMPUTE, especially if want to use multi-function operations Dual Add/Subtract Parallel multiplier/ALI Parallel multiplier with add/subtract 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
14
Absolute data ops -- write Memory
Dm(<addr>) = ureg -- ureg is any register Bits required Address constant bits Which of 256 universal registers 8 bits Which direction bit Whether dm or pm data movement 1 bit Opcode itself ?bits Can’t parallel with ALU/FP or other memory movements -- not enough bits to go around 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
15
Compute operations Only 23 bits available
Requires 1 destination and 2 sources Limited to DATA registers only reg = 4 bits * 3 needed ONLY work on data registers as there is not enough room to describe all uregs -- 8 bits * 3 needed R1 = R2 + R3 allowed R2 = R3 + 2 NOT ALLOWED 32 bit constant I1 = I2 + I3 NOT ALLOWED 64 reg = 6 bits * 3 Can be made conditional and also combined with UREG to UREG moves (but not all MEM <- UREG) Can be ………... 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
16
Pipeline considerations REAL ISSUE
R2 = R1 + R3, R3 = dm(I2, M2), pm(I8,M9) = R2; The value in R2 in R2 = R1 + R3 is not the value of R2 in pm(I8,M9) = R2 The value in R3 in R2 = R1 + R3 is not the value of R3 in R3 = dm(I2, M2) You can do R3 = dm(I2, M2), pm(I8,M9) = R2 but you can’t do R3 = dm(I2, M2), dm(I3,M3) = R2 even though it look like the data bus is free for accesses at begin and end of the cycle because it ain’t. Memory accesses take the WHOLE cycle to complete. Other processors make use 2 cycles to complete equivalent. 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
17
More complex instructions More associated limitations
RULES things that are always allowed -- Sunday to Saturday EXCEPTIONS -- things that are often allowed -- Monday, Wednesday and Friday EXCEPTIONS to EXCEPTIONS -- things that are occasionally allowed -- Monday 8:00 till 9:00 (alternate weeks) 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
18
When are DSP instructions valid?
You are going to customize Most instructions always valid -- From Monday to Friday Some Only between 9:00 am and 9:00 pm Check against architecture -- data paths present? 21k parallelism -- Must be able to pass following checks Can it be fetched in one cycle (opcode limited in size) If not using a constant (32-bit), or too many constants Each resource in use only once during each instruction (dm, pm, * and +) Then probably legal BUT the designers had the final decision and you have to live with it! Get a process to avoid making same mistake twice 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
19
21k Processor architecture
1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
20
DAG generator architecture
Animation from SHARCNavigator on diskette from office Also see DSP workshop book and exercises 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
21
Examples pm(I12, M12) = I6; allowed dm(I4, M4) = I11; allowed
pm(I12, M12) = I11; not allowed as there is not a data path to permit saving of DAGx registers by a DAGx operation. Need to move to a data register and then save that. I11 = pm(I12, M12); allowed -- because there is a path R2 = pm(I11, M12); BUT Some conflict about immediately using DAG2 registers in next instruction. HIDDEN extra cycle before DAG2 registers can be used again. I11 = pm(I12,M12); INSTR not involving DAG2; R2 = pm(I11,M12) But the cycle time could be used with a !DAG2 operation dm(I4, M4) = R11, pm(I12, M12) = R6; allowed dm(I4, M4) = R11, pm(I12, M12) = I6; not allowed 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
22
When are DSP instructions valid?
You are going to customize Most instructions always valid -- From Monday to Friday Some Only between 9:00 am and 9:00 pm Check against architecture 21k parallelism -- Must be able to pass following checks Can it be fetched in one cycle (opcode limited in size) If not using a constant (32-bit), or too many constants Each resource in use only once during each instruction (dm, pm, * and +) Then probably legal, provided ordered correctly on the line R4 = R1 * R5, R8 = R9 + R12; R8 = R9 + R12, R4 = R1 * R5; The designers had the final decision and you have to live by it! Get a process to avoid making same mistake twice 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
23
Under best conditions If instruction described the right way
1 data memory access (in or out) to data registers (post mod) 1 program memory access (in or out) to data registers PROVIDED that the instruction being fetched (N+2) is stored in the instruction cache to avoid bus clashes (post mod) 1 compute operation on data registers (except for certain multi-function instructions with specific registers) A Modify instruction can be used to perform limited ALU operations on index registers (only one modify allowed -- unless you do garbage memory reads to cause the modification of index registers) 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
24
What can be fetched in 1 cycle? BASICALLY is the opcode big enough?
Rx = dm(0x24000), Ry = dm(0x26000); Rx = dm(0x26000), Ry = pm(0x23000); dm(0x24000) = Rx, Ry = dm(0x26000); dm(0x26000) = Rx, pm(0x23000) = Ry; Rx = dm(I1, M1), dm(I2, M2) = Ry; dm(I1, M1) = Rx, pm(I12, M12) = Iy; dm(I1, M1) = Rx, Ry = pm(I12, M12); R4 = R2+R3, dm(I1, M1) = Rx, Ry = pm(I12, M12); R4 = R8+R12, R5 = R8-R12, dm(I1, M1) = Rx, Ry = pm(I12, M12); R4 = R2+R3, R5 = R6-R7, dm(I1, M1) = Rx, Ry = pm(I12, M12); Two DM accesses DM/PM -- 2 constants Two DM accesses DM/PM -- Not DATA reg Looks okay -- BUT! Looks okay -- BUT NO! 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
25
In principle -- ALWAYS LEGAL -- will assemble
dm(I1, M1) = Rx, Ry = pm(I12, M12); In principle -- ALWAYS LEGAL -- will assemble However not always 1 cycle in execution even if no clash occurs with fetching other instructions SHARC internal memory bank divided into 2 blocks Both blocks can be accessed by both DAGs However can only get parallel operations if parallel operations involve different memory blocks I1 = 0x23000, I12 = 0x cycle execution I1 = 0x26010, I12 = 0x cycle execution Check in USER MANUAL for actual memory block values Note that complication of “data size” also present 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
26
Does it meet the COMPUTE (23-bit) restrictions?
Must involve DATA registers only Can only involve a constant if that constant is 1 or CI (Carry In = 1 or 0) result of multiple precision integer operations -- CI IS NOT CONSTANT INTEGER R2 = R3 +R4 legal but not R2 = R3 + I I4 is a “UREG” (needs more bits) not R2 = R Can’t handle 32 bits in COMPUTE not I2 = I (but Modify(I2, 4) OK as memory op) R2 = R3 + R4 is NOT ALWAYS legal (if combined with something else) F6 = F7 * F9, F2 = F3 + F4; is illegal 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
27
Special rules -- IF you want adds and multiplys in a parallel instruction -- more later
F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description (24 - bits when COMPUTE FIELD is 23 bits) Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this to be possible Register description (bits) -- other bits “understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4; 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
28
Tackled today Limitations of instruction sets -- why needed
CISC processor example Recognizing possible limitations in the instruction set of SHARC processor Standard operations Memory accesses -- parallel and non-parallel Parallel COMPUTE instructions Parallel COMPUTE instructions with multiple memory accesses 1/1/2019 ENCM Allowed parallel instructions on DSP processors Copyright
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.