Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Computer Organization and Architecture
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.
1 Microprocessor-based Systems Course 4 - Microprocessors.
IA-32 Processor Architecture
Processor Technology and Architecture
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
CH12 CPU Structure and Function
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
Advanced Computer Architectures
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
The Pentium Processor.
The Pentium Processor Chapter 3 S. Dandamudi To be used with S. Dandamudi, “Introduction to Assembly Language Programming,” Second Edition, Springer,
The Pentium Processor Chapter 3 S. Dandamudi.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Intel Pentium II Processor Brent Perry Pat Reagan Brian Davis Umesh Vemuri.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
COMPILERS CLASS 22/7,23/7. Introduction Compiler: A Compiler is a program that can read a program in one language (Source) and translate it into an equivalent.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
AMD K-6 Processor Evaluation. Registers AMD-K6 Registers General purpose registers Segment registers Floating point registers MMX registers EFLAGS register.
The Evolution of the Intel 80x86 Architecture Chad Derrenbacker Chris Garvey Manpreet Hundal Tom Opfer CS 350 December 9, 1998.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Chapter Overview General Concepts IA-32 Processor Architecture
x86 Processor Architecture
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Chapter 14 Instruction Level Parallelism and Superscalar Processors
COAL Chapter 1,2,3.
Digital Signal Processors
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Instruction Level Parallelism and Superscalar Processors
Instruction Scheduling for Instruction-Level Parallelism
Introduction to Digital Signal Processors (DSPs)
Coe818 Advanced Computer Architecture
Central Processing Unit
Superscalar and VLIW Architectures
Chapter 12 Pipelining and RISC
Created by Vivi Sahfitri
Chapter 11 Processor Structure and function
Presentation transcript:

Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 2 Contents 1. Introduction. 2. Pentium and Itanium processors: pipelining superscalar Very Long Instruction Word (VLIW) Single Instruction Multiple Data (SIMD). 3. Convolvers implemented in FPGAs. 4. Conclusions.

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 3 Mathematical operation where: N, M – the size of the convolution kernel (usually odd numbers), a y,x - an input, b y,x – an output, w i,j - a coefficient of the convolution, D- a common denominator, D=2 n. For image: 512  512  25 frames/s for convolution kernel 3  3 L M = N X  N Y  N F  N  M= multiplies/s L A = N X  N Y  N F  (N  M-1)= additions/s

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 4 C-language program For every output pixel int sum= D/2; // accumulation result – initially D/2 to minimise the division rounding error for(int i= 0; i<N; i++) // vertical convolution { for(int j=0; j<M; j++) // horizontal convolution sum+= *pw++ * *pa++; // the kernel of the convolution pa+= Nx-M; // pa1 will point the first pixel in the next line } sum/= D; // division by the common denominator *pb= (BYTE) sum; // conversion from int (4 bytes) to 1 byte variable, save the result.

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 5 Implementation on different Pentium processors Pipelining in Pentium 75MHz Branch Instruction M+3 Branch Instruction M+3 Instruction M+4

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 6 Loop unrolling sum= D/2 // initialisation sum+= *pa++ *pw++; sum+= *pa * *pw++; pa+= N; // go to the next line sum+= *pa++ *pw++;... sum/= D;

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 7 Loop Unrolling Relative calculation time after LU t LU /t stand

DSP solution e.g. Texas Instruments TMS320C80 Hardware Loop Control Special registers: Loop Counter - number of branches to the start of the loop Loop End - points the last instruction in the loop Loop Start - point the first instruction in the loop Loop Reload

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 9 Superscalar architecture Instruction Level Parallelism Non-optimised // pixel 3 (top-right pixel in the convolution window) xor edx,edx // clear edx mov dl, byte ptr [ecx+2] // load data a (pixel 3) imul edx,dword ptr [edi+8h]//multiply: pixel3 * w[0][2] add eax, edx // accumulate the result of the multiplication Optimised xor edx, edx //start of calculation for pixel 3: clear imul ebx, dword ptr [edi+4] //pixel2: ebx=pel2*w[0][1] mov dl, byte ptr [ecx+2] // pixel 3: dl=pel3 add eax, ebx // end of calculation for pixel2: eax+= ebx imul edx, dword ptr [edi+8] // pixel 3: edx=pel3*w[0][2] xor ebx, ebx // start calculation for pixel 4: clear

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 10 Superscalar (continue) Number of instruction executed in a clock cycle

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 11 DSP TMS320C80 Parallel Processor (VLIW) Execution Units 1. Data Unit (DU). 2. Address Unit (AU) local and global. 3. Program Flow Control Unit (PFCU). Convolution operations 1. multiply, (executed by the DU) 2. accumulate (executed by the DU) 3. load the coefficient (executed by the AU) 4. increment coefficient pointer (executed by the AU) 5. load the input pixel (executed by the AU) 6. increment input pixel pointer (executed by the AU) 7. control the loop (executed by the PFCU)

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 12 VLIW processor Crusoe processor compatible with Pentium

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 13 Itanium microprocessor Greater number of registers (R0-R127 - general purpose registers). VLIW-like architecture - each 128-bit bundle contains 3 instructions, which enables the processor to dispatch instructions with simple instructions decoding. Stops define which instructions can be executed in parallel - simpler grouping instructions in to be executed in parallel. Control and data speculation (included during compile- time).

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 14 Simultaneous Multithreading Time switched multithreading Simultaneous Multithreading Disadvantages: Lager cache size Operation system support

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 15 Single Instruction stream Multiple Data stream (SIMD) MMX Coprocessor Multiply Multiply & accumulate

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 16 SIMD Relative calculation time for standard and MMX processor t MMX /t stand

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 17 Different processor speeds Calculation time [ms] for 512  512 image and for convolution kernel 3  3 (integer unit)

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 18 Comparison of different microprocessors & DSPs Time [ms]

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 19 Dedicated VLSI Processors

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 20 FPGAs Similar design scheme as for ASICs but: Quick time-to-market (simpler designing and testing) Flexible design &dynamic reprogramming Available resources: Memory blocks (for line buffers) Dedicated carry logic (for arithmetic units) Built-in 18x18 multipliers (Virtex II) Design Automation Tools and Cores

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 21 Conclusions: Suggestions for improving microprocessors performance Improve instruction decoding and despatching (by including compile-time information, VLIW-like architecture) Introduce Simultaneous Multithreading Enlarge data format in SIMD

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 22 Will microprocessors speed grow? Clock frequency doubles every five years......but the speed of light never changes (Moore meets Einstein?) Saturated architecture of the microprocessors –pipelining –instruction level parallelism –branch prediction & speculative execution –compile-time information

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 23 Solution? Microprocessor MMX (SIMD) coprocessor FPGA-like coprocessor

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors 24 Thank you for your attention ? The rest of the image is not shown because of insufficient computation power