Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECSE 436 1 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear.

Similar presentations


Presentation on theme: "ECSE 436 1 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear."— Presentation transcript:

1 ECSE 436 1 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

2 ECSE 436 2 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

3 ECSE 436 3 Instruction Set Architecture (ISA) Computers run programs made of simple operations called “instructions” The list of instructions offered by the machine is the “instruction set” The instruction set is what is visible to the programmer (really the compiler, although humans can directly program in “assembly language”) Many different DSPs can share the same ISA but have different hardware (i.e. the implementation of the ISA is different)

4 ECSE 436 4 Instructions Two kinds of information in a computer: instructions data Instructions are stored as numbers, just like data Instructions and data are stored in the memory

5 ECSE 436 5 Basic Computer Organization CPU registers memory storeload PCIR OPCODE OPERANDS Limited number of fast registers for temporary storage Large amount of slow memory Arranged as an array of bytes Instructions are loaded into an Instruction register (IR) from the address pointed to by the program counter (PC). The PC is incremented by the instruction size (in bytes) for each new instruction. E.g. PC  PC + 4

6 ECSE 436 6 Load/Store Architecture (Reg-Reg) CPU registers memory storeload PCIR Instructions can ONLY get their data and store their data from/to registers. The register numbers are specified in the operand fields of the instruction Since data is stored in memory, we need special “load” and “store” instructions for transfers between registers and memory. These two instructions are the ONLY ones allowed to access memory

7 ECSE 436 7 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

8 ECSE 436 8 C6000 Architecture TMS320C62x/C64x 16-bit fixed point DSP TMS320C67x 32-bit floating point DSP Instuction set is a superset of the C62x VLIW Architecture Very Long Instruction Word

9 ECSE 436 9 VLIW VLIW is an architecture that exploits instruction level parallelism (ILP) in the code What is ILP? An instruction is dependent on another if it uses (produces) a value produced (used) by the other instruction

10 ECSE 436 10 Example add c,d,e mult b,e,a The mult instruction must wait for the add instruction to finish before it can execute (sequential data flow) e

11 ECSE 436 11 Example add a,b,e add c,d,f add e,f,g The first two adds have no data dependency and could even be switched in the code with no effect on the correctness of the answer The first two adds could be executed in parallel if we had the hardware to do it (two adders) + + + ab c d e f g

12 ECSE 436 12 Scheduling Given a set of hardware resources (functional units), e.g. a number of adders, multipliers, etc…, the process of determining which instructions can be executed in parallel and which functional units to use on any given clock cycle is called instruction scheduling

13 ECSE 436 13 VLIW VLIW is an architecture that depends on the user (compiler) to do the scheduling Instructions are packed into a very long instruction word (256 bits) There is no scheduling hardware on the chip like on a Pentium 4 which uses hardware, or dynamic scheduling Benefits simple hardware Drawbacks requires sophisticated compilers code compatibility – need to recompile if you use a different DSP, even one with the same ISA

14 ECSE 436 14 C6713 Architecture

15 ECSE 436 15 Maximum Performance C6713 8 functional units, two MACS per cycle 225 MHz 1800 MIPS 6 of the 8 units floating point 225 MHz 1350 MFLOPS

16 ECSE 436 16 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

17 ECSE 436 17 Addressing Modes Load/Store must load registers from memory, process data, store back to memory Linear (indirect addressing) 32 registers A0-A15, B0-B15 can act as pointers *R register R contains the address of memory location where a data value is stored

18 ECSE 436 18 Linear Addressing *R++(d) R contains the address. After R is used, postincrement by discplacement d (default is d = 1), -- post decrements *++R(d) preincrement or predecrement *+R(d) preincrement without modification

19 ECSE 436 19 Circular Addressing

20 ECSE 436 20 Circular Addressing Address Mode Register (AMR)

21 ECSE 436 21 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

22 ECSE 436 22 TMS320 Assemby Language [label][:] mnemonic [operand list] [; comment] [x] means that x is optional label symbolic name for the address of the program line mnemonic instruction, assembler directive, macro cannot start in column 1 operands constants: binary (e.g. 010101b), decimal, hexdecimal (e.g. 0x9f or 9fh) register names symbols defined by assembler directives

23 ECSE 436 23 Assembler Directives The assembler produces COFF (common- obect file format) files COFF files are divided into sections that contain instructions or data Assembler directives are instructions to the assembler on how to manipulate these sections or to define constants they are not machine instructions see Section 4.1 in the text for more details

24 ECSE 436 24 C6000 ISA parallel conditional execution functional unit

25 ECSE 436 25 Instruction Packing Instruction 1 ; instructions 1 and 2 Instruction 2; are executed sequentially Instruction 3; instructions 3, 4, and 5 || Instruction 4; are executed in parallel || Instruction 5 VELOCITI: 1 to 8 execute packets in a fetch packet

26 ECSE 436 26 Sample Instructions ADD.L1A3,A7,A7;add A3+A7->A7 SUB.S1A1,1,A1;subtract 1 from A1 MPY.M2A7,B7,B6 ; mult 16LSBs of A7,B7->B6 || MPYH.M1A7,B7,A6; mult 16MSBs of A7,B7->A6 LDH.D2*B2++,B7; load (B2) -> B7, inc B2 ||LDH.D1*A2++,A7; load (A2) -> A7, inc A2

27 ECSE 436 27 Sample Instructions LoopMVKL.S1x,A4; move 16 LSBs of x addr->A4 MVKH.S2x,A4; move 16 MSBs of x addr->A4 SUB.S1A1,1,A1; decrement A1 [A1]B.S2Loop; branch to Loop if A1 != 0 NOP5; 5 NOP instructions STW.D1A3, *A7; store A3 into (A7)

28 ECSE 436 28 Linear Assembly To effectively program a DSP using assembly language, you need to do the scheduling by hand! Need to account for the number of clock cycles each functional unit takes, etc… Difficult, so TI has linear assembly you don’t have to schedule it, the compiler does it for you can use CPU resources without worrying about scheduling, register allocation, etc…

29 ECSE 436 29 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining

30 ECSE 436 30 Pipelining Key technique to make fast CPUs Multiple instructions are overlapped in execution E.g. Automotive assembly line

31 ECSE 436 31 body (B) 1 hour paint (P) 1 hour Wheels (W) 1 hour Pipelining: principle

32 ECSE 436 32 BobTime (h) 1 2 3 4 5 6 B1B1 0 P1P1 W1W1 B2B2 P2P2 W2W2 2 cars / 6 hours  1/3 car / hour Pipelining: principle(II)

33 ECSE 436 33 Bob Time (h) 1 2 3 4 5 6 B1B1 0 P1P1 W1W1 B2B2 P2P2 W2W2 Alice Bill B3B3 B4B4 B5B5 B6B6 P3P3 P4P4 P5P5 W3W3 W4W4 1 car / hour (3 x speedup) Pipelining: principle(III)

34 ECSE 436 34 COMB. LOGIC cycle time Pipelining: principle(IV)

35 ECSE 436 35 Performance Gain Pipelining a datapath m times can result in up to m times improvement in cycle time E.g. 5-stage pipelined processor is potentially 5 times faster than an unpipelined processor In reality, this is limited to less than m because of restrictions in overlapping instructions

36 ECSE 436 36 5-Stage RISC Pipeline

37 ECSE 436 37 16-Stage C6713 Pipeline Fetch (4 stages) calc. address, send address, wait, receive Decode (2 stages) separate fetch packets into execute packets Execute (10 stages) Different instructions require different number of cycles to execute

38 38 Software and I/O

39 ECSE 436 39 Software and I/O Code efficiency and programming techniques Loop unrolling Software pipelining I/O considerations Interrupts DMA Block processing

40 ECSE 436 40 Software and I/O Code efficiency and programming techniques Loop unrolling Software pipelining I/O considerations Interrupts DMA Block processing

41 ECSE 436 41 Code Efficiency Intrinsic functions e.g. _add2, _mpy, sadd see TMS320C62x/C67x Programmers Guide Packed data use word access to operate on 16-bit data store in the high and low parts of a 32-bit register

42 ECSE 436 42 Loop Unrolling A loop is a compact way of representing a repetitive sequence of instructions, but… The loop condition test is overhead To remove the loop overhead, unroll the loop (make copies of the loop code) key way of exposing parallelism !!! The compiler can now look across loop iterations to find parallel instructions parallelism increased, but so is code size

43 ECSE 436 43 Example ; program A: code without unrolling MVK4,B0 loop: LDH*A5++,A0 ||LDH*A6++,A1 ADDA0,A1,A2 ;add 4 times … SUBB0,1,B0 [B0]Bloop

44 ECSE 436 44 Example ; program B: code with unrolling once MVK2,B0 loop: LDH*A5++,A0 ||LDH*A6++,A1 ; add first 2 numbers ADDA0,A1,A2 … LDH*A5++,A0 ||LDH*A6++,A1 ; add other 2 numbers ADDA0,A1,A2 … SUBB0,1,B0 [B0]Bloop

45 ECSE 436 45 Software and I/O Code efficiency and programming techniques Loop unrolling Software pipelining I/O considerations Interrupts DMA Block processing

46 ECSE 436 46 Software Pipelining Software pipelining compiler technique (don’t confuse with h/w pipelining) Schedule multiple iterations of a loop together to fill any empty cycles and maximize functional unit usage -O2 –O3

47 ECSE 436 47 Software Pipelining The general idea of this optimization is to uncover long sequences of statements without branch statements Reorganize loops to interleave instructions from different iterations Dependent instructions within a single loop iteration are then separated from one another by an entire loop body Increases possibilities of scheduling

48 ECSE 436 48 Software Pipelining

49 ECSE 436 49 Software Pipelining Advantage: yields shorter code than loop unrolling and uses fewer registers Software pipelining is crucial for VLIW processors Often, both software pipelining and loop unrolling are used

50 ECSE 436 50 Software and I/O Code efficiency and programming techniques Loop unrolling Software pipelining I/O considerations Interrupts DMA Block processing

51 ECSE 436 51 Interrupts A signal that causes the processor to suspend its current program and execute a special subroutine interrupt service routine (ISR) Sources On-chip peripherals timers, serial ports External resets, external peripherals Software interrupts arithmetic exceptions (divide by zero, overflow)

52 ECSE 436 52 Interrupts

53 ECSE 436 53 Interrupts

54 ECSE 436 54 Interrupts

55 ECSE 436 55 Software and I/O Code efficiency and programming techniques Loop unrolling Software pipelining I/O considerations Interrupts DMA Block processing

56 ECSE 436 56 Direct Memory Access Data transfer without intervention of processsor memory and CPU peripherals and CPU DMA channel: source address destination address element count in a frame number of frames in a block

57 ECSE 436 57 Software and I/O Code efficiency and programming techniques Loop unrolling Software pipelining I/O considerations Interrupts DMA Block processing

58 ECSE 436 58 Block Processing

59 ECSE 436 59 Ping-Pong Buffering Ping-pong buffer (double buffer) DMA channel delivers N samples of data in and out of buffers while the DSP operates on data in the current buffer Next block, roles of the buffers are changed


Download ppt "ECSE 436 1 DSP architecture Review of basic computer architecture concepts C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear."

Similar presentations


Ads by Google