Download presentation
Presentation is loading. Please wait.
Published byΜαργαρίτες Ζωγράφου Modified over 6 years ago
1
Subject Name: Digital Signal Processing Algorithms & Architecture
Subject Code:10EC751 Prepared By: S. Shikky Marice, Prashanth, Shivlila Department: Electronics and Communication Engineering Date: 11/14/2018
2
Architectures for programmable digital signal –processing devices
Unit-02 Architectures for programmable digital signal –processing devices 11/14/2018
3
Basic Architectural Features
A programmable DSP device should provide instructions similar to a conventional microprocessor. The instruction set of a typical DSP device should include the following, a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc b. Logical operations such as AND, OR, NOT, XOR etc c. Multiply and Accumulate (MAC) operation d. Signal scaling operation In addition to the above provisions, the architecture should also include, a. On chip registers to store immediate results b. On chip memories to store signal samples (RAM) c. On chip memories to store filter coefficients (ROM) 11/14/2018
4
DSP Computational Building Blocks Multipliers
The advent of single chip multipliers paved the way for implementing DSP functions on a VLSI chip. Parallel multipliers replaced the traditional shift and add multipliers now days. Parallel multipliers take a single processor cycle to fetch and execute the instruction and to store the result. They are also called as Array multipliers. The key features to be considered for a multiplier are: a. Accuracy b. Dynamic range c. Speed 11/14/2018
5
Parallel multipliers:
Consider the multiplication of two unsigned numbers A and B. Let A be represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be represented using n bits as (Bn-1 Bn-2 …….. B1 B0). Then the product of these two numbers is given by, Braun multiplier. 11/14/2018
6
Multipliers for signed numbers:
Consider two signed numbers A and B, 11/14/2018
7
Bus Widths Consider the multiplication of two n bit numbers X and Y. The product Z can be atmost 2n bits long. In order to perform the whole operation in a single execution cycle, we require two buses of width n bits each to fetch the operands X and Y and a bus of width 2n bits to store the result Z to the memory. Although this performs the operation faster, it is not an efficient way of implementation as it is expensive. We have two alternatives to solve this problem, a. Use the n bits operand bus and save Z at two successive memory locations. Although it stores the exact value of Z in the memory, it takes two cycles to store the result. b. Discard the lower n bits of the result Z and store only the higher order n bits into the memory. It is not applicable for the applications where accurate result is required. 11/14/2018
8
Another alternative can be used for the applications where speed is not a major
concern. In which latches are used for inputs and outputs thus requiring a single bus to fetch the operands and to store the result 11/14/2018
9
Shifters Shifters are used to either scale down or scale up operands or the results. The following scenarios give the necessity of a shifter a. While performing the addition of N numbers each of n bits long, the sum can grow up to n+log2 N bits long. If the accumulator is of n bits long, then an overflow error will occur. This can be overcome by using a shifter to scale down the operand by an amount of log2N. b. Similarly while calculating the product of two n bit numbers, the product can grow up to 2n bits long. Generally the lower n bits get neglected and the sign bit is shifted to save the sign of the product. c. Finally in case of addition of two floating-point numbers, one of the operands has to be shifted appropriately to make the exponents of two numbers equal. 11/14/2018
10
Barrel Shifters For an input of length n, log2 n control lines are required. And an additional control line is required to indicate the direction of the shift. 11/14/2018
11
11/14/2018
12
How many control lines are required to implement the shifter?
A Barrel Shifter is to be designed with 16 inputs for left shifts from 0 to 15 bits. How many control lines are required to implement the shifter? As the number of bits used to represent the input are 16, log2 16=4 control inputs are required. It is required to find the sum of 64, 16 bit numbers. How many bits should the accumulator have so that the sum can be computed without the occurrence of overflow error or loss of accuracy? The sum of 64, 16 bit numbers can grow up to (16+ log2 64 )=22 bits long. Hence the accumulator should be 22 bits long in order to avoid overflow error from occurring. 11/14/2018
13
Multiply and Accumulate Unit
11/14/2018
14
a. Using shifters at the input and the output of the MAC
Overflow and Underflow While designing a MAC unit, attention has to be paid to the word sizes encountered at the input of the multiplier and the sizes of the add/subtract unit and the accumulator, as there is a possibility of overflow and underflows. Overflow/underflow can be avoided by using any of the following methods viz a. Using shifters at the input and the output of the MAC b. Providing guard bits in the accumulator c. Using saturation logic 11/14/2018
15
Arithmetic and Logic Unit
Saturation logic Overflow/ underflow will occur if the result goes beyond the most positive number or below the least negative number the accumulator can handle. Thus the overflow/underflow error can be resolved by loading the accumulator with the most positive number which it can handle at the time of overflow and the least negative number that it can handle at the time of underflow. This method is called as saturation logic. Arithmetic and Logic Unit Arithmetic logic unit (ALU) carries out additional arithmetic and logic operations required for a DSP: add, subtract, increment, decrement, negate AND, OR, NOT, XOR, compare shift, multiply (uncommon to general microprocessors) with additional features common to general microprocessors: status flags for sign, zero, carry and overflow overflow management via saturation logic register files for storing intermediate results 11/14/2018
16
Bus Architecture and Memory
Arithmetic Logic Unit of a DSP Bus Architecture and Memory Bus architecture and memory play a significant role in dictating cost, speed and size of DSPs. Common architectures include the von Neumann and Harvard architectures. 11/14/2018
17
Von Neumann Architecture
Harvard Architecture 11/14/2018
18
Von Neumann Architecture
program and data reside in same memory single bus is used to access both Implications: slows down program execution since processor has to wait for data even after instruction is made available Harvard Architecture program and data reside in separate memories with two independent buses Implications: faster program execution because of simultaneous memory access capability 11/14/2018
19
11/14/2018
20
on-chip = on-processor
On-Chip Memory on-chip = on-processor help in running the DSP algorithms faster than when memory is off-chip dedicated addresses and data buses are available speed: on-chip memories should match the speeds of the ALU Operations size: the more area chip memory takes, the less area available for other DSP functions Data Addressing Capabilities Efficient way of accessing data (signal sample and filter coefficients) can significantly improve implementation performance flexible ways to access data helps in writing efficient. programs data addressing modes enhance DSP implementations 11/14/2018
21
Special Addressing Modes: Circular Bit-reversed
DSP Addressing Modes Immediate Register Direct Indirect Special Addressing Modes: Circular Bit-reversed 11/14/2018
22
Immediate Addressing Mode: operand is explicitly known in value
capability to include data as part of the instruction Instruction Operation ADD #imm #imm + A A #imm: value represented by imm (fixed number such as filter coefficient is known ahead of time) A: accumulator register Register Addressing Mode operand is always in processor register reg capability to reference data through its register Instruction Operation ADD reg reg + A A reg : processor register provides operand A: accumulator register 11/14/2018
23
Direct Addressing Mode operand is always in memory location mem
capability to reference data by giving its memory location directly Instruction Operation ADD mem mem + A A mem: specied memory location provides operand (e.g., memory could hold input signal value) A: accumulator register Indirect Addressing Mode operand memory location is variable operand address is given by the value of register addrreg operand accessed using pointer addrreg Instruction Operation ADD addrreg addrreg + A A addrreg: needs to be loaded with the register location before use A: accumulator register 11/14/2018
24
Special Addressing Modes
Circular Addressing Mode: circular buffer allows one to handle a continuous stream of incoming data samples; once the end of the buffer is reached, samples are wrapped around and added to the beginning again useful for implementing real-time digital signal processing where the input stream is effectively continuous Bit-Reversed Addressing Mode: address generation unit can be provided with the capability of providing bit-reversed indices useful for implementing radix-2 FFT (fast Fourier Transform) algorithms where either the input or output is in bit-reversed order 11/14/2018
25
Can avoid constantly testing for the need to wrap.
Circular Addressing: Can avoid constantly testing for the need to wrap. Suppose we consider eight registers to store an incoming data stream. Reference Index Address 0 = 0 mod 8 = 8 mod 8 = 16 mod = 0 1 = 1 mod 8 = 9 mod 8 = 17 mod = 1 2 = 2 mod 8 = 10 mod 8 = 18 mod = 2 3 = 3 mod 8 = 11 mod 8 = 19 mod = 3 4 = 4 mod 8 = 12 mod 8 = 20 mod = 4 5 = 5 mod 8 = 13 mod 8 = 21 mod = 5 6 = 6 mod 8 = 14 mod 8 = 22 mod = 6 7 = 7 mod 8 = 15 mod 8 = 23 mod = 7 11/14/2018
26
Bit-Reversed Addressing:
Input Index Output Index 000 = = 0 001 = = 4 010 = = 2 011 = = 6 100 = = 1 101 = = 5 110 = = 3 111 = = 7 Speed Issues fast execution of algorithms is the most important requirement of a DSP architecture high speed instruction operation large throughputs facilitated by advances in VLSI technology and design innovations 11/14/2018
27
Hardware Architecture
dedicated hardware support for multiplications, scaling, loops and repeats, and special addressing modes are essential for fast. DSP implementations Harvard architecture significantly improves program execution time compared to von Neumann on-chip memories aid speed of program execution considerably Parallelism Parallelism means: provision of multiple function units, which may operate in parallel to increase throughput multiple memories different ALUs for data and address computations advantage: algorithms can perform more than one operation at a time increasing speed disadvantage: complex hardware required to control units and make sure instructions and data can be fetched simultaneously 11/14/2018
28
Step 1: instruction fetch Step 2: instruction decode
Pipelining architectural feature in which an instruction is broken into a number of steps a separate unit performs each step at the same time usually working on different stage of data advantage: if repeated use of the instruction is required, then after an initial latency the output throughput becomes one instruction per unit time disadvantages: pipeline latency, having to break instructions up into equally-timed units Pipelining example: Five steps: Step 1: instruction fetch Step 2: instruction decode Step 3: operand fetch Step 4: execute Step 5: save 11/14/2018
29
Pipelining for speeding up the execution of an instruction Time slot
Step 1 step2 Step 3 Step 4 Step 5 Result T0 Inst1 T1 Inst 2 Inst 1 T2 Inst 3 T3 Inst 4 T4 Inst 5 Inst 1 complete t5 Inst 6 Inst 2 complete 11/14/2018
30
Consider 8-tap FIR filter: y(n) =∑h(i)x(n-i)
The filter can be implemented in many ways depending on the multipliers and accumulators avaliable. 1.Implementation using a single MAC unit X(n-7) X(n-1) X(n-2) X(n-3) X(n-4) X(n-5) X(n-6) X(n) 8T 8T 8T 8T 8T 8T 8T Multiplier MAC unit multiplexer 11/14/2018 H(0) h(1) h(2) h(3) h(4) h(5) h(6) h(7)
31
Pipelined implementation of an 8-tap FIR filter using eight MACs
Parallel implementation using two MAC units Type of implementation Maximum sample rate Maximum throughput 1 MAC 1/8T 1 sample in 8T units of time Pipelined(8 multipliers and 8 adders) 1/T 1 sample in T units of time 2 MAC 1/4T 1 sample in 4T units of time 11/14/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.