Implementing a FIR-filter algorithm using MMX instructions by Lars Persson
Merging the history buffer and the input buffer num_taps-1 samples from last call num_taps-1 new samples zero-padded X 0 X 1 X 2 … … X -3 X -2 X -1 When computing the first num_taps-1 samples, we need to access both the input and the history buffer. Depending on the implementation, this might require extra branch instructions in the inner or outer loop. Improved history buffer:
Preparing the taps array The filter tap array is prepared according to the Intel example. That is, it is reversed and 3 shifted copies are made. Also, the number of taps is rounded to a multiple of 4. t1 t2 t3 0 t3 t2 t t3 t2 t t3 t2 t t3 t2 t1 0
The convolution sum LOOP: // Load 4 samples movq mm0, [esi] movq mm1, mm0 // preload taps that are shifted 2 // and 3 steps lea edi, [ebx+2*ecx] movq mm4, [edi] movq mm7, [edi+ecx] // multiply with taps pmaddwd mm0, [ebx] paddd mm6, mm0 // multiply with taps shifted one // step movq mm0, mm1 pmaddwd mm0, [ebx+ecx] paddd mm5, mm0 // multiply with taps shifted 2 // steps pmaddwd mm4, mm1 paddd mm3, mm4 // multiply with taps shifted 3 // steps pmaddwd mm7, mm1 paddd mm2, mm7 // update pointes for next loop // iter. add esi, 8 add ebx, 8 sub eax, 1 jnz LOOP
Parallel summation // low samples mm6 mm5 movq mm4, mm6 punpckhdq mm4, mm5 punpckldq mm6, mm5 paddd mm6, mm4 // [ out(n+1) out(n) ] in mm6 // high samples mm3 mm2 movq mm4, mm3 punpckhdq mm4, mm2 punpckldq mm3, mm2 paddd mm3, mm4 // [ out(n+3) out(n+2) ] in mm3
Loop optimization Inner loop keeps as much data as possible in the registers. Only taps and samples are loaded from memory. The parallel summation is done with 8 instructions as compared to 12 instructions in my SSE version. Memory copying is done with the rep instruction prefix. This avoids a branch instruction.
So far about 36 million cycles including float to short conversion..
Optimizing float to short conversion The C language standard requires that float to integer conversion is done with truncation, i.e. 3.6 is converted to 3 as opposed to 4 when using rounding. On the X86 architecture this requires changing the FPU control word which is a very expensive instruction. Solution is to directly call the fistp instruction.
__ftol: 00402B24 push ebp 00402B25 mov ebp,esp 00402B27 add esp,0FFFFFFF4h 00402B2A wait 00402B2B fnstcw word ptr [ebp-2] 00402B2E wait 00402B2F mov ax,word ptr [ebp-2] 00402B33 or ah,0Ch 00402B36 mov word ptr [ebp-4],ax 00402B3A fldcw word ptr [ebp-4] 00402B3D fistp qword ptr [ebp-0Ch] 00402B40 fldcw word ptr [ebp-2] 00402B43 mov eax,dword ptr [ebp-0Ch] 00402B46 mov edx,dword ptr [ebp-8] 00402B49 leave 00402B4A ret mov ecx, num_samples mov esi, inputs mov edi, input mov esi, [esi] sub ecx, 1 LOOP1: flddword ptr [esi+ecx*4] fistp word ptr [edi+ecx*2] sub ecx, 1 jge LOOP1 Compiler calls this function once for every conversion. Optimized conversion routine.