* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Slides:



Advertisements
Similar presentations
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Advertisements

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,
Process for changing “C-based” design to SHARC assembler ADDITIONAL EXAMPLE M. R. Smith, Electrical and Computer Engineering University of Calgary, Canada.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.
Generation of highly parallel code for 2106X processors An introduction Developed by M. R. Smith Presented by S. Lei SHARC2000 Workshop, Boston, September.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.
Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,
Computer Organization
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture
William Stallings Computer Organization and Architecture 7th Edition
A Closer Look at Instruction Set Architectures
ECE 353 Lab 3 Pipeline Simulator
Chapter 15 Control Unit Operation
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Software and Hardware Circular Buffer Operations
Single-cycle datapath, slightly rearranged
TigerSHARC processor General Overview.
Single-Cycle CPU DataPath.
Computer Organization “Central” Processing Unit (CPU)
Central Processing Unit
Microcoded CCU (Central Control Unit)
Program Flow on ADSP2106X SHARC Pipeline issues
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
ENCM K Interrupts Theory and Practice
Understanding the TigerSHARC ALU pipeline
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
William Stallings Computer Organization and Architecture 7th Edition
Comparing 68k (CISC) with 21k (Superscalar RISC DSP)
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Understanding the TigerSHARC ALU pipeline
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
Overview of TigerSHARC processor ADSP-TS101 Compute Operations
Guest Lecturer TA: Shreyas Chand
-- Tutorial A tool to assist in developing parallel ADSP2106X code
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* From AMD 1996 Publication #18522 Revision E
William Stallings Computer Organization and Architecture 8th Edition
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
* 2000/08/1307/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these.
Computer Architecture
General Optimization Issues
Introduction to Microprocessor Programming
ECE 352 Digital System Fundamentals
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Compute Operations
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Chapter 11 Processor Structure and function
Course Code 114 Introduction to Computer Science
Computer Operation 6/22/2019.
* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Pipelined datapath and control
Presentation transcript:

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Parallel instructions on a DSP processor. What’s allowed, what’s not and why not? M. Smith, Electrical and Computer Engineering, University of Calgary, Canada smithmr @ ucalgary.ca *

Overview You have all these wonderful resources to run in parallel in the latest DSP architectures. More parallel -- means more speed? Must also take into account the physical limitations in any system 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

To be tackled Limitations of instruction sets -- Why needed? CISC processor example Recognizing possible limitations in the instruction set of SHARC processor Standard operations Memory accesses -- parallel and non-parallel Parallel COMPUTE instructions Parallel COMPUTE instructions with multiple memory accesses 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Why are there instruction limitations? Data bus size For example 68332 -- 32 bit long words, single 16 bit bus for fetching data and instructions Immediate implication -- 16 bits easy to fetch, 32 bits slower as need multiple fetches. Immediate implication -- Faster instruction execution if can describe all necessary information within the first 16-bits fetched Try to arrange for most commonly used instructions to be described within 16-bit opcode. Immediate implication -- Conflicts with instruction fetches when have to fetch lots of data Speed of fetches from memory Where to obtain efficiency? Real or imagined 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

68K Branch instruction efficiency -- BRA.S Want to work in 1 FETCH -- 16 bits available to describe all aspects of the instruction’s operation 4 bits taken up to say THIS IS A BRANCH INSTRUCTION and am not something else 4 bits taken up for 16 types of tests possible on this CPU This means ONLY 8 bits left to describe displacement Jump location = PC + 127/PC - 128 Displacement is automatically sign extended to 32 bits If want to branch to an instruction further than +-128 from current PC then need slower instruction with fetches of 16-bit opcode + 16-bit displacement BRA.S LOCATION --> ADD.L #(LOCATION32 - CURRENTPC8), PC 1 1 C C C C P P P P P P P P 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

68k Efficient Move -- MOVEQ.L Want instruction to work in 1 FETCH -- 16 bits available to describe all aspects of operation 5 bits taken up to say MOVEQ instruction and not something else 3 bits taken up for 8 possible data registers 8 bits left to specify the value Value = + 127 to - 128 Value is automatically sign extended to 32 bits MOVEQ.L #64, D0 1 1 1 D D D P P P P P P P P 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

68k -- Efficient Add -- ADDQ.L Want to work in 1 FETCH -- 16 bits available to describe operation 5 bits taken up to say I’M AN ADDQ instruction and 2 to say -- byte/word/long operation 6 bits taken up for describing Effective Address (addressing modes) e.g. MOVEQ.L #6, (A0)+ 3 bits left to describe the value Value = 1 to 8 -- why not 0? Why not -128 to +127? Use SUBQ for negative values Single fetch (4 cycles) -- may take up to 28 cycles to complete 1 1 Q Q Q S S E E E E E E 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Add binary coded decimal -- ABCD.B Want to work in 1 FETCH -- 16 bits available to describe operation -- very limited capability Only works on Bytes (effectively op-code is 18 bits long as most other instructions use 2-bits to distinguish .B, .W and .L) 9 bits taken up to say I’M AN ABCD instruction 6 bits taken up for two 8 possible data registers (or 1 data + 1 predecrement addressing operation) 1 bit left to select which operation is to be performed 0 means data register to data register 1 means memory access (predec) to register -- Need PREDECREMENT as must work from right to left for BCD values 1 1 R R R 1 M d d d 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Maximum Efficiency for most used instruction 68K designers decided that the MOVE instruction was the most useful and gave it the most flexibility Many different types of MOVE operations possible S = 3 sizes of data moved -- .B, .W and .L R = 8 destination registers M = 8 different destination EA E = 64 different source EA (more than expected) Many EA’s require additional operations to fetch the complete opcode information (during decode in ENCM415 model of the phases of CPU operation) S S R R R M M M e e e e e e 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Possible instruction formats -- 8-bit opcode Bits to distinguish between instructions Instr 1 0 ??????? 7 bits for other info Instr 2 1 0 ?????? 6 bits for other info Instr 3 1 1 0 ????? 5 bits for other info and other instructions possible OR Instr 3 1 1 ?????? 6 bits for other info No other instructions available 68k like “first format” -- 21k like “second” 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Same basic issues on SHARC You can’t do everything with all possible resources Note -- you don’t need to “do everything” Decide what you want to do best and then customize the opcode to handle that operation High speed DSP processor applications 48 bit program data bus Need ? Bits to describe the type of operation Need ? Bits to describe memory operations Need ? Bits to describe ALU/FPU operations Look at data book for information 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Compute/dreg<->DM/dreg<->PM 3 bits opcode 001 -- see User Manual for more information 2 bits for direction memory op READ/WRITE on two busses. Query -- how handle “No memory ops needed”? -- Different OPCODE? 12 bits ONLY to describe 2 index registers and 2 modify registers DM_I, DM_M, PM_I, PM_M Many opcode bits are saved since order of registers in opcode is preset. 8 bits to describe which data registers used 16 possible registers used in DM and PM 23 bits to describe Compute operations Means many things can’t be done with certain legal DM ops 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Restriction -- MANY Must be a DAG1 index register and a DAG2 index register -- NOT 2 DAG1 index registers Would need extra opcode bits, extra internal data paths Must be POST-MODIFY with register -- NOT premodify -- NOT constant modify Describing post/pre modify constants would take many bits and not permit parallel operations to be described. Host of restrictions on COMPUTE, especially if want to use multi-function operations Dual Add/Subtract Parallel multiplier/ALI Parallel multiplier with add/subtract 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Absolute data ops -- write Memory Dm(<addr>) = ureg -- ureg is any register Bits required Address constant 32bits Which of 256 universal registers 8 bits Which direction 1 bit Whether dm or pm data movement 1 bit Opcode itself ?bits Can’t parallel with ALU/FP or other memory movements -- not enough bits to go around 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Compute operations Only 23 bits available Requires 1 destination and 2 sources Limited to DATA registers only -- 16 reg = 4 bits * 3 needed ONLY work on data registers as there is not enough room to describe all uregs -- 8 bits * 3 needed R1 = R2 + R3 allowed R2 = R3 + 2 NOT ALLOWED 32 bit constant I1 = I2 + I3 NOT ALLOWED 64 reg = 6 bits * 3 Can be made conditional and also combined with UREG to UREG moves (but not all MEM <- UREG) Can be ………... 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Pipeline considerations REAL ISSUE R2 = R1 + R3, R3 = dm(I2, M2), pm(I8,M9) = R2; The value in R2 in R2 = R1 + R3 is not the value of R2 in pm(I8,M9) = R2 The value in R3 in R2 = R1 + R3 is not the value of R3 in R3 = dm(I2, M2) --------------------------------------------------------------- You can do R3 = dm(I2, M2), pm(I8,M9) = R2 but you can’t do R3 = dm(I2, M2), dm(I3,M3) = R2 even though it look like the data bus is free for accesses at begin and end of the cycle because it ain’t. Memory accesses take the WHOLE cycle to complete. Other processors make use 2 cycles to complete equivalent. 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

More complex instructions More associated limitations RULES -- things that are always allowed -- Sunday to Saturday EXCEPTIONS -- things that are often allowed -- Monday, Wednesday and Friday EXCEPTIONS to EXCEPTIONS -- things that are occasionally allowed -- Monday 8:00 till 9:00 (alternate weeks) 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

When are DSP instructions valid? You are going to customize Most instructions always valid -- From Monday to Friday Some Only between 9:00 am and 9:00 pm Check against architecture -- data paths present? 21k parallelism -- Must be able to pass following checks Can it be fetched in one cycle (opcode limited in size) If not using a constant (32-bit), or too many constants Each resource in use only once during each instruction (dm, pm, * and +) Then probably legal BUT the designers had the final decision and you have to live with it! Get a process to avoid making same mistake twice 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

21k Processor architecture 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

DAG generator architecture Animation from SHARCNavigator on diskette from office Also see DSP workshop book and exercises 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Examples pm(I12, M12) = I6; allowed dm(I4, M4) = I11; allowed pm(I12, M12) = I11; not allowed as there is not a data path to permit saving of DAGx registers by a DAGx operation. Need to move to a data register and then save that. I11 = pm(I12, M12); allowed -- because there is a path R2 = pm(I11, M12); BUT Some conflict about immediately using DAG2 registers in next instruction. HIDDEN extra cycle before DAG2 registers can be used again. I11 = pm(I12,M12); INSTR not involving DAG2; R2 = pm(I11,M12) But the cycle time could be used with a !DAG2 operation dm(I4, M4) = R11, pm(I12, M12) = R6; allowed dm(I4, M4) = R11, pm(I12, M12) = I6; not allowed 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

When are DSP instructions valid? You are going to customize Most instructions always valid -- From Monday to Friday Some Only between 9:00 am and 9:00 pm Check against architecture 21k parallelism -- Must be able to pass following checks Can it be fetched in one cycle (opcode limited in size) If not using a constant (32-bit), or too many constants Each resource in use only once during each instruction (dm, pm, * and +) Then probably legal, provided ordered correctly on the line R4 = R1 * R5, R8 = R9 + R12; R8 = R9 + R12, R4 = R1 * R5; The designers had the final decision and you have to live by it! Get a process to avoid making same mistake twice 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Under best conditions If instruction described the right way 1 data memory access (in or out) to data registers (post mod) 1 program memory access (in or out) to data registers PROVIDED that the instruction being fetched (N+2) is stored in the instruction cache to avoid bus clashes (post mod) 1 compute operation on data registers (except for certain multi-function instructions with specific registers) A Modify instruction can be used to perform limited ALU operations on index registers (only one modify allowed -- unless you do garbage memory reads to cause the modification of index registers) 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

What can be fetched in 1 cycle? BASICALLY is the opcode big enough? Rx = dm(0x24000), Ry = dm(0x26000); Rx = dm(0x26000), Ry = pm(0x23000); dm(0x24000) = Rx, Ry = dm(0x26000); dm(0x26000) = Rx, pm(0x23000) = Ry; Rx = dm(I1, M1), dm(I2, M2) = Ry; dm(I1, M1) = Rx, pm(I12, M12) = Iy; dm(I1, M1) = Rx, Ry = pm(I12, M12); R4 = R2+R3, dm(I1, M1) = Rx, Ry = pm(I12, M12); R4 = R8+R12, R5 = R8-R12, dm(I1, M1) = Rx, Ry = pm(I12, M12); R4 = R2+R3, R5 = R6-R7, dm(I1, M1) = Rx, Ry = pm(I12, M12); Two DM accesses DM/PM -- 2 constants Two DM accesses DM/PM -- Not DATA reg Looks okay -- BUT! Looks okay -- BUT NO! 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

In principle -- ALWAYS LEGAL -- will assemble dm(I1, M1) = Rx, Ry = pm(I12, M12); In principle -- ALWAYS LEGAL -- will assemble However not always 1 cycle in execution even if no clash occurs with fetching other instructions SHARC internal memory bank divided into 2 blocks Both blocks can be accessed by both DAGs However can only get parallel operations if parallel operations involve different memory blocks I1 = 0x23000, I12 = 0x26000 -- 1 cycle execution I1 = 0x26010, I12 = 0x26000 -- 2 cycle execution Check in USER MANUAL for actual memory block values Note that complication of “data size” also present 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Does it meet the COMPUTE (23-bit) restrictions? Must involve DATA registers only Can only involve a constant if that constant is 1 or CI (Carry In = 1 or 0) result of multiple precision integer operations -- CI IS NOT CONSTANT INTEGER R2 = R3 +R4 legal but not R2 = R3 + I4 -- I4 is a “UREG” (needs more bits) not R2 = R3 + 4 -- Can’t handle 32 bits in COMPUTE not I2 = I2 + 4 -- (but Modify(I2, 4) OK as memory op) R2 = R3 + R4 is NOT ALWAYS legal (if combined with something else) F6 = F7 * F9, F2 = F3 + F4; is illegal 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Special rules -- IF you want adds and multiplys in a parallel instruction -- more later F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description 4 + 4 + 4 + 4 + 4 + 4 (24 - bits when COMPUTE FIELD is 23 bits) Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this to be possible Register description 4 + 2 + 2 + 4 + 2 + 2 (bits) -- other bits “understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4; 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca

Tackled today Limitations of instruction sets -- why needed CISC processor example Recognizing possible limitations in the instruction set of SHARC processor Standard operations Memory accesses -- parallel and non-parallel Parallel COMPUTE instructions Parallel COMPUTE instructions with multiple memory accesses 1/15/2019 ENCM515 -- Allowed parallel instructions on DSP processors Copyright smithmr@ucalgary.ca