* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Slides:

Advertisements

Similar presentations

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Advertisements

Instruction Set Design

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

SUPPLEMENTARY CHAPTER 2 Instruction Addressing Modes

Computer Organization and Architecture

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.

6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.

Generation of highly parallel code for TigerSHARC processors An introduction This presentation will probably involve audience discussion, which will create.

Chapter 12 Pipelining Strategies Performance Hazards.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta,

Systematic development of programs with parallel instructions SHARC ADSP21XXX processor M. Smith, Electrical and Computer Engineering, University of Calgary,

The CPU, RISC and CISC Component 1.

Advanced Architectures

Computer Organization and Architecture + Networks

Immediate Addressing Mode

William Stallings Computer Organization and Architecture 8th Edition

Embedded Systems Design

Computer Organization and Design

Software and Hardware Circular Buffer Operations

TigerSHARC processor General Overview.

Microcoded CCU (Central Control Unit)

Program Flow on ADSP2106X SHARC Pipeline issues

Overview of SHARC processor ADSP and ADSP-21065L

Overview of SHARC processor ADSP Program Flow and other stuff

Trying to avoid pipeline delays

ENCM K Interrupts Theory and Practice

Comparing 68k (CISC) with 21k (Superscalar RISC DSP)

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

M. R. Smith, University of Calgary, Canada ucalgary.ca

* M. R. Smith, University of Calgary, Alberta,

Computer Architecture and the Fetch-Execute Cycle

Comparing 68k (CISC) with 21k (Superscalar RISC DSP)

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Understanding the TigerSHARC ALU pipeline

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

-- Tutorial A tool to assist in developing parallel ADSP2106X code

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

* From AMD 1996 Publication #18522 Revision E

Classification of instructions

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

Computer Architecture

Getting serious about “going fast” on the TigerSHARC

General Optimization Issues

Explaining issues with DCremoval( )

ECE 352 Digital System Fundamentals

Chapter 12 Pipelining and RISC

Overview of SHARC processor ADSP-2106X Compute Operations

Overview of SHARC processor ADSP-2106X Compute Operations

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

Overview of SHARC processor ADSP-2106X Memory Operations

Understanding the TigerSHARC ALU pipeline

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

A first attempt at learning about optimizing the TigerSHARC code

Lecture 4: Instruction Set Design/Pipelining

Chapter 11 Processor Structure and function

William Stallings Computer Organization and Architecture

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items.

* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.

Presentation transcript:

* 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta, Canada smithmr @ ucalgary.ca *

To be tackled today Loop overhead -- depends on implementation Performing multiple memory accesses to an array Loop overhead can steal many cycles Loop overhead -- depends on implementation Standard loop with test at the start -- while ( ) Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies CISC -- hardware RISC -- intelligent compilers DSP -- hardware 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Review -- CISC processor instruction phases Fetch -- Obtain op-code PC-value out on Address Bus Instruction op-code at Memory[PC] on Data Bus and then into Instruction Register Decode -- Bringing required values (internal or external) to the ALU input. Immediate -- Additional memory access for value -- Memory[PC] Absolute -- Additional memory access for address value and then further access for value -- Memory[Memory[PC]] Indirect -- Additional memory access to obtain value at Memory[AddressReg] Execute -- ALU operation Writeback -- ALU value to internal/external storage May require additional memory accesses to obtain address used during storage May require additional memory operations to perform storage. 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Clements -- Microprocessor Systems Design PWS Publishing ISBN 0-534-94822-7 Data from the memory appears near the end of the read cycle 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Access of cosine table -- Common DSP issue * 07/16/96 Need to access the cosine table during FFT, DCT etc Calculation of cosines very time consuming Pre-calculate and store. Overhead only access time Simple increment -- cos[q] -- q = 0, k, 2k, 3k, 4k Cosine values repeat every N No need to store cosine values from N+ 1 to 2N If 0 <= q < N then use CosineTable[q]; else N <= q < 2N then use CosineTable[q - N]; else 2N <= q < 3N then etc 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca *

Similar problems when doing delay lines, FIR or IIR filters Output pt Input pt A2 A11 Output pt Input pt A3 A12 Output pt Input pt A13 A4 Input pt Output pt A14 A5 Similar problems when doing delay lines, FIR or IIR filters 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

CISC -- Accessing arrays when indexing is not simple cos[q] -- q = 0, k, 2k, 3k, 4k -- 68K -- 16 bit CISC start with #define c_address &cos[0] MOVEA.L #x_address, A0 FETCH (4) LOAD (2 * 4) STORE (0) loop handling for (loop = 0, loop < (n- 1) k; loop += k // put cos value in D0 MOVE.L (A0), D0 FETCH (4) LOAD (2 * 4) STORE (0) ADD.L #(4 *k), A0 FETCH (4) LOAD (2 * 4) STORE (0) ADD32 (4?) endfor Problems -- k is a constant -- unnecessary extra cycles Pointer A2 can get beyond end of fixed ROM array -- tackle next lecture Code from actual loop itself takes time 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Timing required to handle DSP loops for k = 0 to (N-1) Body of Code -- BofC cycles endfor Important feature -- how much overhead time is used in handling the loop construct itself? Three components Set up Time Body of code time -- BofC cycles Handling the loop itself 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Basic Loop body Set up loop -- loop overhead -- done once Check conditions -- loop overhead -- done many times Do Code Body -- done many times -- useful Loop Back + counter increment -- loop overhead -- many Define Loop Efficiency = N * Tcodebody -------------------------------------------- Tsetup + N * (Tcodebody + Tconditions + Tloopback) Different Efficiencies depending on size of the loop Need to learn good approximation techniques and recognize the two extremes 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

3 basic loop constructs While loop Main compare test at top of loop Modified do-while loop with initial test Initial compare test at top Main compare test at the bottom of the loop Down-counting do-while loop with initial test No compare operations in test. Relies on the setting of condition code flags during adjustment of the loop counter. Can increase overhead in some algorithms 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Basic 68K CISC loop -- Test at start MOVE.L #0, count -- Set up -- count in register Fetch instr. (FI4) + Fetch 32-bit constant (FC 2 * 4) + operation (OP0) LOOP: CMP.L #N, count -- (FI4 FC8, OP4 -- 32bit subtract) BGE somewhere Actually ADD.L #(somewhere - 4), PC (ADD OF 16-bit DISPLACEMENT TO PC -- FI4 FC4 OP(0 or 4) ) Body Cycles ADD.L #1, count -- (FI4, FC8, OP4) JMP LOOP -- (FI4, FC8, OP4) N * BodyCycles LOOP EFFECIENCY = --------------------------------------------------------- 12 + N*(28 + BodyCycles + 32) Since 60 >> 12 (5 times) then ignore startup cycles even if N small 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Check at end -- 68K CISC loop MOVE.L #0, count -- (FI4, FC8, OP0) JMP LOOPTEST -- (FI4, FC8, OP4) LOOP: Body Cycles ADD.L #1, count -- (FI4, FC8, OP4) LOOPTEST: CMP.L #N, count -- (FI4, FC8, OP4) BLT LOOP -- (FI4, FC4, OP4) N * BodyCycles EFFECIENCY = --------------------------------------------------------- 26 + N*BodyCycles + 44*(N+1) Since 44 > 26 (1.8 times) then can’t Ignore startup cycles if N small and Body Cycles small -- Small loop means inefficient 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Down Count -- 68K CISC loop MOVEQ.L #0, array_index -- (FI4, FC0, OP0) MOVE.L #N, count -- (FI4, FC0, OP0) JMP LOOPTEST -- (FI4, FC8, OP4) LOOP: BodyCycles using instructions of form OPERATION (Addreg, Index) ADDQ.L #1, array_index -- (FI4, FC0, OP0?) SUBQ.L #1, count -- (FI4, FC0, OP0?) LOOPTEST : BGT LOOP -- (FI4, FC4, OP4) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 24 + N*BodyCycles + 20*(N+1) Since 20 < 24 then can’t Ignore startup if N small and Body Cycles small 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Down Count -- Possible sometimes MOVEA.L #array_start, Addreg -- (FI4, FC0, OP0) MOVE.L #N, count -- (FI4, FC0, OP0) JMP LOOPTEST -- (FI4, FC8, OP4) LOOP: BodyCycles using autoincrement mode OPCODE (Addreg)+ SUBQ.L #1, count -- (FI4, FC0, OP0?) LOOPTEST : BGT LOOP -- (FI4, FC4, OP4) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 24 + N*BodyCycles + 16*(N+1) Since 16 < 24 then can’t Ignore startup if N small and Body Cycles small NOTE -- Number of cycles needed in body of the loop decreases in this case 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Loop Efficiency on CISC processor Efficiency depends on how loop constructed Standard while-loop Check at end -- modified do-while Down counting -- with/without auto-incrementing addressing modes Compiler versus hardcode efficiency See Embedded System Design magazine Sept./Oct 2000 What happens with different processor architectures? 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Check at end -- 29K RISC loop CONST count, 0 -- (1 cycle) JUMP LOOPTEST -- (1 cycle) NOP -- (1 cycle -- delay slot) LOOP: Bodycycles -- autoincrementing mode -- NOT AN OPTION ADDU count, count, 1 -- (1 cycle) LOOPTEST: CPLE TruthReg, count, N -- (1 cycle should be 2 -- register forwarding) (Boolean Truth Flag in TruthReg -- which could be any register) JMPT TruthReg LOOP -- (1 cycle) NOP -- (1 cycle -- delay slot) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 3 + N * BodyCycles + 4*(N+1) Since 4 = 3 then can’t Ignore startup if N small and Body Cycles small Since dealing with single cycle operations -- body cycle count smaller 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Down Count -- 29K RISC loop CONST index, 0 -- 1 cycle JUMP LOOPTEST -- 1 cycle CONST count, N -- in delay slot LOOP: BodyCycles SUBU count, count, 1 -- 1 cycle LOOPTEST: CPGT TruthReg, count, 0 -- 1 cycle JMPT TruthReg, LOOP -- 1 cycle ADDS index, index, 1 -- in delay N * BodyCycles Loop Efficiency = --------------------------------------------------------- 3 + N*BodyCycles + 4*(N+1) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Efficiency on RISC processors Not much difference between Test at end, Down count loop format HOWEVER body-cycle count has decreased Processor is highly pipelined -- Loop jumps cause the pipeline to stall Need to take advantage of delay slots Efficiency depends on DSP algorithm being implemented? What about DSP processors? Architecture is designed for efficiency in this area. 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Check at end -- ADSP-21K loop count, = 0; -- (1 cycle) number = N; -- (1 cycle) JUMP LOOPTEST (DB); -- (1 cycle) NOP; -- (1 cycle -- delay slot) NOP; -- (1 cycle -- delay slot) LOOP: BODYCYCLES count = count + 1; -- (1 cycle) LOOPTEST Comp(count, number); -- (1 cycle) IF LT JUMP LOOP (DB); -- (1 cycle) NOP; -- (1 cycle -- delay slot) NOP; -- (1 cycle -- delay slot) N * BodyCycles EFFICIENCY = --------------------------------------------------------- 5 + N*BodyCycles + 5*(N+1) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Speed improve -- Possible? count = 1; -- (1 cycle) number = N; -- (1 cycle) JUMP LOOPTEST (DB); -- (1 cycle) count = count - 1; -- (1 cycle -- delay slot) number = number - 1; -- (1 cycle -- delay slot) LOOP: BODYCYCLES count = count + 1; -- (1 cycle) LOOPTEST Comp(count, number); -- (1 cycle) IF LT JUMP LOOP (DB); -- (1 cycle) count = count + 1; -- (1 cycle -- delay slot) NOP; -- (1 cycle -- delay slot) N * BodyCycles EFFICIENCY = --------------------------------------------------------- 5 + N*BodyCycles + 4*(N+1) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Down Count -- ADSP-21K loop number = 0; -- (1 cycle) JUMP (PC, LOOPTEST) (DB); -- (1 cycle) index = 0; -- (1 cycle -- in delay slot) count = N ; -- (1 cycle -- in delay slot) LOOP: Bodycycles count = count - 1; -- (1 cycle) LOOPTEST Comp(count, number); -- (1 cycle) IF GT JUMP (PC, LOOP) (DB); -- (1 cycle) index = index + 1; -- (1 cycle -- delay slot) NOP; -- (1 cycle -- delay slot) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 4 + N*BodyCycles + 5*(N+1) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

* Improved Down Count -- ADSP21K loop Is code valid -- or 1 off in times around loop? 07/16/96 number = -1; -- Bias the loop counter (1 cycle) JUMP (PC, LOOPTEST) (DB); -- (1 cycle) index = 0; -- (1 cycle -- in delay slot) count = (N-1); -- (1 cycle -- in delay slot) LOOP: Body cycles LOOPTEST Comp(count, number); -- (1 cycle) IF GT JUMP (PC, LOOP); -- (1 cycle) index = index + 1; -- (1 cycle -- delay slot) count = count - 1; -- (1 cycle -- delay slot) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 4 + N*BodyCycles + 4*(N+1) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca *

Faster Loops Need to go to special features CISC -- special Test, Conditional Jump and Decrement in 1 instruction RISC -- Change algorithm format DSP -- Special hardware for loops Maximum of six-nested loops Can be a hidden trap when writing “C” 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Recap -- 68K CISC loop down count MOVEQ.L #0, index -- (FI4, FC0, OP0) MOVE.L #N, count -- (FI4, FC0, OP0) JMP LOOPTEST -- (FI4, FC8, OP4) LOOP: BodyCycles ADDQ.L #1, index -- (FI4, FC0, OP0?) SUBQ.L #1, count -- (FI4, FC0, OP0?) LOOPTEST : BGT LOOP -- (FI4, FC4, OP4) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 24 + N*BodyCycles + 20*(N+1) Since 24=20 then can’t Ignore startup if N small and Body Cycles small 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Hardware 68K CISC loop MOVEQ.L #0, index -- (FI4 FC0 OP0) MOVE.L #(N-1), count -- (FI4 FC0 OP0) JMP LOOPTEST -- (FI4, FC8, OP4) LOOP: BodyCycles ADDQ.L #1, index -- (FI4, FC0 OP0?) LOOPTEST: DBCC count, LOOP -- (FI4, FC4, OP4) N * BodyCycles Loop Efficiency = ------------------------------------------------- 24 + N*BodyCycles + 16*(N+1) Possibility that Efficiency almost 100% if the Body Instructions are small enough to fit into cache 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Custom loop hardware on RISC For long loops -- loop overhead small -- no need to be concerned about the loop overhead (unless loop in loop) For small loops -- unroll the loop so that hardcoded 20 instructions rather than 1 instruction looped 20 times For medium loops -- advantage over CISC normally is that instructions more efficient -- 1 cycles compared to 4 -- 8 cycles For medium loops -- advantage over DSP normally is that instructions more efficient 1 RISC cycle compared to 2 DSP cycles -- (not 21K since 1 to 1) For more information See the Micro 1992 articles See the CCI articles 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

21k Processor architecture 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

DAG generator architecture Animation from SHARCNavigator on diskette from office Also see DSP workshop book and exercises 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Recap -- Improved Down Count -- 21K DSP loop * 07/16/96 number = -1 -- (1 cycle) JUMP (PC, LOOPTEST) (DB) -- (1 cycle) index = 0 -- (1 cycle -- in delay slot) count = (N-1) -- (1 cycle -- in delay slot) LOOP: Body cycles LOOPTEST Comp(count, number) -- (1 cycle ) IF GT JUMP (PC, LOOP) -- (1 cycle) index = index + 1 -- (1 cycle -- delay slot) count = count - 1 -- (1 cycle -- delay slot) N * BodyCycles Loop Efficiency = --------------------------------------------------------- 4 + N*BodyCycles + 4*(N+1) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca *

Hardware Loop -- 21K DSP loop count = N -- (1 cycle) count = pass count -- (1 cycle) IF LE JUMP (PC, PASTLOOP) (DB) -- (1 cycle) index = 0 -- (1 cycle -- in delay slot) nop -- (1 cycle -- in delay slot) HARDWARE_LOOP: LCNTR N; do PASTLOOP-1 until LCE -- 1 cycle -- parallel instruction Body-cycles PASTLOOP: Last cycle of loop is at location PASTLOOP -1 Rest of the program code N * BodyCycles Loop Efficiency = --------------------------------------------------------- 6 + N*BodyCycles) 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

DSP Hardware loop Efficiency from a number of areas Hardware counter No overhead for decrement No overhead for compare Pipelining efficient Processor knows to fetch instructions from start of loop, not past the loop Has some problems if loop size is too small -- loop timing is longer than expected as processor needs to flush the pipeline and restart it 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca

Tackled today Loop overhead -- depends on implementation Performing access to memory in a loop Loop overhead can steal many cycles Loop overhead -- depends on implementation Standard loop with test at the start -- while ( ) Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies CISC -- hardware RISC -- intelligent compilers DSP -- hardware 1/2/2019 ENCM515 -- High Speed Loops -- Hardware and Software Copyright smithmr@ucalgary.ca