Download presentation
Presentation is loading. Please wait.
1
UNIT-III ARM APPLICATION DEVELOPMENT
2
Introduction to DSP on ARM FIR filter, IIR filter
OVERVIEW: Introduction to DSP on ARM FIR filter, IIR filter Discrete fourier transform Exception handling Interrupts, Interrupt handling schemes Firmware and boot loader Embedded Operating systems Integrated Development Environment STDIO Libraries Peripheral Interface Application of ARM Processor Caches, Memory protection Units, Memory Management units Future ARM Technologies.
3
ARM VS DSP Both DSP and ARM Processors are types of microprocessors. A microprocessor is a silicon chip that contains the central processing unit (CPU) of the device. The ARM Processors are based on the RISC design of computer processors. The RISC microprocessors are usually for generic usage. The DSP processor is another type of microprocessor. DSP stands for digital signal processing. It is basically any signal processing that is done on a digital signal or information signal. A DSP processor is a specialized microprocessor that has an architecture optimized for the operational needs of digital signal processing.
4
APPLICATIONS: ARM ARM processors are well known and desired for light, portable, battery-powered devices such as smart phones and tablet computers. DSP DSP aims to modify or improve the signal. It is characterized by the representation of discrete units, such as discrete time, discrete frequency, or discrete domain signals. The main goal of a DSP processor is to measure, filter and/or compress digital or analog signals
5
INTRODUCTION TO DSP ON ARM
6
Microprocessors now wield enough computational power to process real-time digitized signals.
mp3 audio players, digital cameras, and digital mobile/cellular telephones. Processing digitized signals requires high memory bandwidths and fast multiply accumulate operations. Traditionally an embedded or portable device would contain two types of processor: A microcontroller would handle the user interface, and a separate DSP processor would manipulate digitized signals such as audio.
7
However, now you can often use a single microprocessor to perform both tasks because of the higher performance and clock frequencies available on microprocessors today. A single-core design can reduce cost and power consumption over a two-core solution. The ARMv5TE extensions available in the ARM9E and later cores provide efficient multiply accumulate operations.
8
Increase in performance available on different generations of the ARM core.
10
The ARM core is not a dedicated DSP
The ARM core is not a dedicated DSP. There is no single instruction that issues a multiply accumulate and data fetch in parallel. However, by reusing loaded data you can achieve a respectable DSP performance. The key idea is to use block algorithms that calculate several results at once, and thus require less memory bandwidth, increase performance, and decrease power consumption compared with calculating single results.
11
Saturation clips a result to a fixed range to prevent overflow.
The ARM also differs from a standard DSP when it comes to precision and saturation. In general, ARM does not provide operations that saturate automatically. Saturating versions of operations usually cost additional cycles. Saturation clips a result to a fixed range to prevent overflow. saturate16(x) = x clipped to the range −0x to +0x00007fff inclusive saturate32(x) = x clipped to the range −0x to +0x7fffffff inclusive On the other hand, ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit operations very well.
12
Guidelines for Writing DSP Code for ARM
Design the DSP algorithm so that saturation is not required because saturation will cost extra cycles. Use extended-precision arithmetic or additional scaling rather than saturation. Design the DSP algorithm to minimize loads and stores. Once you load a data item, then perform as many operations that use the datum as possible. Write ARM assembly to avoid processor interlocks. The results of load and multiply instructions are often not available to the next instruction without adding stall cycles. Sometimes the results will not be available for several cycles. There are 14 registers available for general use on the ARM, r0 to r12 and r14. Design the DSP algorithm so that the inner loop will require 14 registers or fewer.
13
DSP on the ARM7TDMI The ARM7TDMI has a 32-bit by 8-bit per cycle multiply array with early termination. It takes four cycles for a 16-bit by 16-bit to 32-bit multiply accumulate. Load instructions take three cycles and store instructions two cycles for zero-wait-state memory or cache. Guidelines for Writing DSP Code for the ARM7TDMI Load instructions are slow, taking three cycles to load a single value. To access memory efficiently use load and store multiple instructions LDM and STM. Load and store multiples only require a single cycle for each additional word transferred after the first word. This often means it is more efficient to store 16-bit data values in 32-bit words.
14
The multiply instructions use early termination based on the second operand in the product Rs. For predictable performance use the second operand to specify constant coefficients or multiples. Multiply is one cycle faster than multiply accumulate. It is sometimes useful to split an MLA instruction into separate MUL and ADD instructions. You can then use a barrel shift with the ADD to perform a scaled accumulate.
15
FIR filters
16
The finite impulse response (FIR) filter is a basic building block of many DSP applications
You can use a FIR filter to remove unwanted frequency ranges, boost certain frequencies, or implement special effects. We will concentrate on efficient implementation of the filter on the ARM. The FIR filter is the simplest type of digital filter. The filtered sample yt depends linearly on a fixed, finite number of unfiltered samples xt . Let M be the length of the filter. Then for some filter coefficients, ci :
17
A direct form discrete-time FIR filter of order N.
18
Example: FIR filter C: Assembler for (i=0, f=0; i<N; i++)
f = f + c[i]*x[i]; Assembler ; loop initiation code MOV r0,#0 ; use r0 for I MOV r8,#0 ; use separate index for arrays ADR r2,N ; get address for N LDR r1,[r2] ; get value of N MOV r2,#0 ; use r2 for f
19
FIR filter, cont’.d ADR r3,c ; load r3 with base of c
ADR r5,x ; load r5 with base of x ; loop body loop LDR r4,[r3,r8] ; get c[i] LDR r6,[r5,r8] ; get x[i] MUL r4,r4,r6 ; compute c[i]*x[i] ADD r2,r2,r4 ; add into running sum ADD r8,r8,#4 ; add one word offset to array index ADD r0,r0,#1 ; add 1 to i CMP r0,r1 ; exit? BLT loop ; if i < N, continue
20
Let’s look at the issue of dynamic range and possible overflow of the output signal. Suppose that we are using Qn and Qm fixed-point representations X[t ] and C[i] for xt and ci , respectively. In other words: Then A[t ] is a Q(n+m) representation of yt . But, how large is A[t ]? How many bits of precision do we need to ensure that A[t ] does not overflow its integer
22
A[t] = C[0]*X[t] + C[1]*X[t-1] + ... + C[M-1]*X[t-M+1];
Block FIR filters we can usually implement filters using integer sums of products, without the need to check for saturation or overflow: A[t] = C[0]*X[t] + C[1]*X[t-1] C[M-1]*X[t-M+1]; Generally X[t ] and C[i] are k-bit integers and A[t ] is a 2k-bit integer, where k = 8, 16, or 32.
23
By a long filter, we mean that M is so large that you can’t hold the filter coefficients in registers. You should optimize short filters such as previous example on a case-by-case basis. For these you can hold many coefficients in registers. An R-way block filter implementation calculates the R value A[t ], A[t + 1], , A[t + R − 1] using a single pass of the data X[t ] and coefficients C[i]. This reduces the number of memory accesses by a factor of R over calculating each result separately. So R should be as large as possible.
24
An R × S block filter is an R-way block filter where we read S data and coefficient values at a time for each iteration of the inner loop. On each loop we accumulate R × S products onto the R accumulators. Typical 4 × 3 block filter implementation.
25
Each accumulator on the left is the sum of products of the coefficients on the right multiplied by the signal value heading each column. The diagram starts with the oldest sample Xt−M+1 since the filter routine will load samples in increasing order of memory address. Each inner loop of a 4 × 3 filter accumulates the 12 products in a 4 × 3 parallelogram.
26
Writing FIR Filters on the ARM
If the number of FIR coefficients is small enough, then hold the coefficients and history samples in registers. Often coefficients are repeated. This will save on the number of registers you need. If the FIR filter length is long, then use a block filter algorithm of size R × (R − 1) or R ×R. Choose the largest R possible given the 14 available general purpose registers on the ARM. Ensure that the input arrays are aligned to the access size. This will be 64-bit when using LDRD. Ensure that the array length is a multiple of the block size. Schedule to avoid all load-use and multiply-use interlocks.
27
IIR Filters
28
An infinite impulse response (IIR) filter is a digital filter that depends linearly on a finite number of input samples and a finite number of previous filter outputs. In other words, it combines a FIR filter with feedback from previous filter outputs. Mathematically, for some coefficients bi and aj : If you feed in the impulse signal x = (1, 0, 0, 0, . . .), then yt may oscillate forever. This is why it has an infinite impulse response. However, for a stable filter, yt will decay to zero. We will concentrate on efficient implementation of this filter.
29
IIR filter example
30
You can calculate the output signal yt directly, using general equation. In this case the code is similar to the FIR. However, this calculation method may be numerically unstable. It is often more accurate, and more efficient, to factorize the filter into a series of biquads—an IIR filter with M = L = 2: We can implement any IIR filter by repeatedly filtering the data by a number of biquads. To see this, we use the z-transform. This transform associates with each signal xt ,a polynomial x(z) defined as "Biquad" is an abbreviation of "biquadratic", which refers to the fact that in the Z domain, its transfer function is the ratio of two quadratic functions
31
So, now we only have to implement biquads efficiently
So, now we only have to implement biquads efficiently. On the face of it, to calculate yt for a biquad, we need the current sample xt and four history elements xt−1, xt−2, yt−1, yt−2. However, there is a trick to reduce the number of history or state values we require from four to two. biquads
32
We define an intermediate signal st by
The coefficient b0 controls the amplitude of the biquad. We can assume that b0 = 1 when performing a series of biquads, and use a single multiply or shift at the end to correct the signal amplitude. So, to summarize, we have reduced an IIR to filtering by a series of biquads of the form
33
For a block IIR, we split the input signal xt into large frames of N samples. We make multiple passes over the signal, filtering by as many biquads as we can hold in registers on each pass. Typically for ARMv4 processors we filter by one biquad on each pass; for ARMv5TE processors, by two biquads. Implementing 16-bit IIR Filters Factorize the IIR into a series of biquads. Choose the data precision so there can be no overflow during the IIR calculation. Use a block IIR algorithm, dividing the signal to be filtered into large frames. On each pass of the sample frame, filter by M biquads. Choose M to be the largest number of biquads so that you can hold the state and coefficients in the 14 available registers on the ARM. Ensure that the total number of biquads is a multiple of M. As always, schedule code to avoid load and multiply use interlocks.
34
The Discrete Fourier Transform
35
The Discrete Fourier Transform (DFT) converts a time domain signal xt to a frequency domain signal yk . The associated inverse transform (IDFT) reconstructs the time domain signal from the frequency domain signal. This tool is heavily used in signal analysis and compression. It is particularly powerful because there is an algorithm, the Fast Fourier Transform (FFT), that implements the DFT very efficiently. we will look at some efficient ARM implementations of the FFT. The DFT acts on a frame of N complex time samples, converting them into N complex frequency coefficients.
36
The Fast Fourier Transform
The idea of the FFT is to break down the transform by factorizing N. Suppose for example that N = R × S. Split the output into S blocks of size R and the input into R blocks of size S. In other words:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.