Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 2 Contents 1. Introduction. 2. Pentium and Itanium processors: pipelining superscalar Very Long Instruction Word (VLIW) Single Instruction Multiple Data (SIMD). 3. Convolvers implemented in FPGAs. 4. Conclusions.
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 3 Mathematical operation where: N, M – the size of the convolution kernel (usually odd numbers), a y,x - an input, b y,x – an output, w i,j - a coefficient of the convolution, D- a common denominator, D=2 n. For image: 512 512 25 frames/s for convolution kernel 3 3 L M = N X N Y N F N M= multiplies/s L A = N X N Y N F (N M-1)= additions/s
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 4 C-language program For every output pixel int sum= D/2; // accumulation result – initially D/2 to minimise the division rounding error for(int i= 0; i<N; i++) // vertical convolution { for(int j=0; j<M; j++) // horizontal convolution sum+= *pw++ * *pa++; // the kernel of the convolution pa+= Nx-M; // pa1 will point the first pixel in the next line } sum/= D; // division by the common denominator *pb= (BYTE) sum; // conversion from int (4 bytes) to 1 byte variable, save the result.
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 5 Implementation on different Pentium processors Pipelining in Pentium 75MHz Branch Instruction M+3 Branch Instruction M+3 Instruction M+4
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 6 Loop unrolling sum= D/2 // initialisation sum+= *pa++ *pw++; sum+= *pa * *pw++; pa+= N; // go to the next line sum+= *pa++ *pw++;... sum/= D;
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 7 Loop Unrolling Relative calculation time after LU t LU /t stand
DSP solution e.g. Texas Instruments TMS320C80 Hardware Loop Control Special registers: Loop Counter - number of branches to the start of the loop Loop End - points the last instruction in the loop Loop Start - point the first instruction in the loop Loop Reload
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 9 Superscalar architecture Instruction Level Parallelism Non-optimised // pixel 3 (top-right pixel in the convolution window) xor edx,edx // clear edx mov dl, byte ptr [ecx+2] // load data a (pixel 3) imul edx,dword ptr [edi+8h]//multiply: pixel3 * w[0][2] add eax, edx // accumulate the result of the multiplication Optimised xor edx, edx //start of calculation for pixel 3: clear imul ebx, dword ptr [edi+4] //pixel2: ebx=pel2*w[0][1] mov dl, byte ptr [ecx+2] // pixel 3: dl=pel3 add eax, ebx // end of calculation for pixel2: eax+= ebx imul edx, dword ptr [edi+8] // pixel 3: edx=pel3*w[0][2] xor ebx, ebx // start calculation for pixel 4: clear
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 10 Superscalar (continue) Number of instruction executed in a clock cycle
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 11 DSP TMS320C80 Parallel Processor (VLIW) Execution Units 1. Data Unit (DU). 2. Address Unit (AU) local and global. 3. Program Flow Control Unit (PFCU). Convolution operations 1. multiply, (executed by the DU) 2. accumulate (executed by the DU) 3. load the coefficient (executed by the AU) 4. increment coefficient pointer (executed by the AU) 5. load the input pixel (executed by the AU) 6. increment input pixel pointer (executed by the AU) 7. control the loop (executed by the PFCU)
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 12 VLIW processor Crusoe processor compatible with Pentium
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 13 Itanium microprocessor Greater number of registers (R0-R127 - general purpose registers). VLIW-like architecture - each 128-bit bundle contains 3 instructions, which enables the processor to dispatch instructions with simple instructions decoding. Stops define which instructions can be executed in parallel - simpler grouping instructions in to be executed in parallel. Control and data speculation (included during compile- time).
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 14 Simultaneous Multithreading Time switched multithreading Simultaneous Multithreading Disadvantages: Lager cache size Operation system support
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 15 Single Instruction stream Multiple Data stream (SIMD) MMX Coprocessor Multiply Multiply & accumulate
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 16 SIMD Relative calculation time for standard and MMX processor t MMX /t stand
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 17 Different processor speeds Calculation time [ms] for 512 512 image and for convolution kernel 3 3 (integer unit)
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 18 Comparison of different microprocessors & DSPs Time [ms]
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 19 Dedicated VLSI Processors
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 20 FPGAs Similar design scheme as for ASICs but: Quick time-to-market (simpler designing and testing) Flexible design &dynamic reprogramming Available resources: Memory blocks (for line buffers) Dedicated carry logic (for arithmetic units) Built-in 18x18 multipliers (Virtex II) Design Automation Tools and Cores
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 21 Conclusions: Suggestions for improving microprocessors performance Improve instruction decoding and despatching (by including compile-time information, VLIW-like architecture) Introduce Simultaneous Multithreading Enlarge data format in SIMD
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 22 Will microprocessors speed grow? Clock frequency doubles every five years......but the speed of light never changes (Moore meets Einstein?) Saturated architecture of the microprocessors –pipelining –instruction level parallelism –branch prediction & speculative execution –compile-time information
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 23 Solution? Microprocessor MMX (SIMD) coprocessor FPGA-like coprocessor
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 24 Thank you for your attention ? The rest of the image is not shown because of insufficient computation power