Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer.

Slides:

Advertisements

Similar presentations

The Game of Algebra or The Other Side of Arithmetic The Game of Algebra or The Other Side of Arithmetic © 2007 Herbert I. Gross by Herbert I. Gross & Richard.

Advertisements

Coordinatate systems are used to assign numeric values to locations with respect to a particular frame of reference commonly referred to as the origin.

UNC Pixel Planes designed by Prof. Henry Fuchs et al.

1 RTL Example: Video Compression – Sum of Absolute Differences Video is a series of frames (e.g., 30 per second) Most frames similar to previous frame.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

1. Microprocessor. mp mp vs. CPU Intel family of mp General purpose mp Single chip mp Bit slice mp.

CENG536 Computer Engineering Department Çankaya University.

1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.

MM3FC Mathematical Modeling 3 LECTURE 3

Data Parallel Algorithms Presented By: M.Mohsin Butt

The Control Unit: Sequencing the Processor Control Unit: –provides control signals that activate the various microoperations in the datapath the select.

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.

VLSI Design Spring03 UCSC By Prof Scott Wakefield Final Project By Shaoming Ding Jun Hu

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.

NYU DARPA DIS kick-off September 24, Comparing IA-64 and HPL-PD NYU.

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

Copyright © 2007 Intel Corporation. ® SP 3D Running Average Implementation SSE + OpenMP Benchmarking on different platforms Dr. Zvi Danovich, Senior Application.

Binary Arithmetic Math For Computers.

Data Representation Number Systems.

Chapter 1 Algorithm Analysis

MATH 224 – Discrete Mathematics

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

Digital Logic Chapter 4 Presented by Prof Tim Johnson

Edge Detection (with implementation on a GPU) And Text Recognition (if time permits) Jared Barnes Chris Jackson.

Conversion and Coding (12) 10. Conversion and Coding (12) Conversion.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 10-1 The ALU.

Copyright © Cengage Learning. All rights reserved. Real Numbers and Their Basic Properties 1.

Prerequisites: Fundamental Concepts of Algebra

Cosc 2150: Computer Organization Chapter 2 Part 1 Integers addition and subtraction.

Numerical Methods Fast Fourier Transform Part: Informal Development of Fast Fourier Transform

CSC 221 Computer Organization and Assembly Language

DATA REPRESENTATION, DATA STRUCTURES AND DATA MANIPULATION TOPIC 4 CONTENT: 4.1. Number systems 4.2. Floating point binary 4.3. Normalization of floating.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

TK2633 : MICROPROCESSOR & INTERFACING Lecture 10: Fixed Point Arithmetic Lecturer: Ass. Prof. Dr. Masri Ayob.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2004 Simonas Šaltenis E1-215b

Arithmetic Circuits. Half Adder ABSumCarry

Introduction to MMX, XMM, SSE and SSE2 Technology

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

1/33 Basic Scheme February 8, 2007 Compound expressions Rules of evaluation Creating procedures by capturing common patterns.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Nottingham Image Analysis School, 23 – 25 June NITS Image Segmentation Guoping Qiu School of Computer Science, University of Nottingham

LECTURE 4 Logic Design. LOGIC DESIGN We already know that the language of the machine is binary – that is, sequences of 1’s and 0’s. But why is this?

Concurrency and Performance Based on slides by Henri Casanova.

REGISTER TRANSFER LANGUAGE (RTL) INTRODUCTION TO REGISTER Registers1.

Logic Gates Dr.Ahmed Bayoumi Dr.Shady Elmashad. Objectives  Identify the basic gates and describe the behavior of each  Combine basic gates into circuits.

Coordinatate systems are used to assign numeric values to locations with respect to a particular frame of reference commonly referred to as the origin.

Buffering Techniques Greg Stitt ECE Department University of Florida.

EET 1131 Unit 5 Boolean Algebra and Reduction Techniques

Cosc 2150: Computer Organization

PERFORMANCE EVALUATIONS

Dr.Ahmed Bayoumi Dr.Shady Elmashad

Matrices Rules & Operations.

REGISTER TRANSFER LANGUAGE (RTL)

Dr. Clincy Professor of CS

Data Representation in Computer Systems

Exploiting Parallelism

Core i7 micro-processor

A Level Computing Component 2

Chapter 6 Floating Point

Algorithms with numbers (1) CISC4080, Computer Algorithms

Lecture 5: GPU Compute Architecture

CS/EE 217 – GPU Architecture and Parallel Programming

Programming with Shared Memory Specifying parallelism

Chapter3 Fixed Point Representation

CHAPTER 69 NUMBER SYSTEMS AND CODES

Presentation transcript:

Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer January 2008

Copyright © 2008 Intel Corporation. 2 Agenda Mathematics of 3D convolution Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution Main idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 line Basic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by line Main routine of algorithm: 3D convolution – line by line Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 3 3D convolution (with 3x3x3 kernel K) is computed for each pixel P as 3D convolution (with 3x3x3 kernel K) is computed for each pixel P as where p is source pixels and K – convolution kernel values. In another words, each new pixel is the sum of 27 products of source pixels values with appropriate kernel values inside kernel cubic: In another words, each new pixel is the sum of 27 products of source pixels values with appropriate kernel values inside kernel cubic: 3D convolution – what is it ? KpKpKp KpKpKp KpKpKp P = sum

Copyright © 2008 Intel Corporation. 4 Recombination from 1D convolutions If 1D convolution is defined as If 1D convolution is defined as therefore final line of 3D convolution is i.e. 3D convolution can be presented as double sum of 9 1D convolutions – 3 planes with 3 lines in plane

Copyright © 2008 Intel Corporation. 5 Agenda Mathematics of 3D convolution Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution Main idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 line Basic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by line Main routine of algorithm: 3D convolution – line by line Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 6 Main part of algorithm: 1D convolution idea of implementation Let start from 3 sequential QUADs from sourse line, multiply all three by different K (kernel) values (denoted as k -, k c,k + ) Let start from 3 sequential QUADs from sourse line, multiply all three by different K (kernel) values (denoted as k -, k c,k + ) k k-k-00k-k-000 k-k-11k-k-111 k-k-22k-k-222 k-k-33k-k-333 k-k-44k-k-444 k-k-55k-k-555 k-k-66k-k-666 k-k-77k-k-777 k c kckc00kckc000 kckc11kckc111 kckc22kckc222 kckc33kckc333 kckc44kckc444 kckc55kckc555 kckc66kckc666 kckc77kckc777 k k+k+00k+k+000 k+k+11k+k+111 k+k+22k+k+222 k+k+33k+k+333 k+k+44k+k+444 k+k+55k+k+555 k+k+66k+k+666 k+k+77k+k+777 k-k-k-k- k-k-k-k- k-k-k-k- k-k-k-k- Multiplication kckckckc kckckckc kckckckc kckckckc k+k+k+k+ k+k+k+k+ k+k+k+k+ k+k+k+k+ Selection by PALIGNR Using PALIGNR, select QUAD shifted left for products with k - and QUAD shifted right for products with k +. Sum up them with unshifted QUAD products with k c : Using PALIGNR, select QUAD shifted left for products with k - and QUAD shifted right for products with k +. Sum up them with unshifted QUAD products with k c : Sourse pixels p k - k-k-00k-k-000 k-k-11k-k-111 k-k-22k-k-222 kckc00kckc000 kckc11kckc111 kckc22kckc222 kckc33kckc333 k+k+11k+k+111 k+k+22k+k+222 k+k+33k+k+333 k+k+44k+k+444 P0P0P0P0 P1P1P1P1 P2P2P2P2 P3P3P3P3 k - p 2 +k c p 3 +k + p 4 k - p 1 +k c p 2 +k + p 3 k - p 0 +k c p 1 +k + p 2 k - p -1 +k c p 0 +k + p 1 Resulting sums are convolution expressions for central QUAD !

Copyright © 2008 Intel Corporation. 7 Agenda Mathematics of 3D convolution Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution Main idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 line Basic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by line Main routine of algorithm: 3D convolution – line by line Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 8 Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results. Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results. To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step. To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step. Below is the description of 1 line (from 3 lines) computations in general main loop step. Below is the description of 1 line (from 3 lines) computations in general main loop step. It starts from loading EIGHT 16bit source pixels and unpacking them into 2 32bit QUADs : Basic routine of algorithm: 2D convolution – 1 line p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 Load EIGHT of 16 bit source pixels Shuffle Equivalence First unpacked 32bit QUAD Second unpacked 32bit QUAD

Copyright © 2008 Intel Corporation. 9 Multiply 2 QUADs (from previous step) with three different K values (denoted as k -, k c, k + ), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step. Multiply 2 QUADs (from previous step) with three different K values (denoted as k -, k c, k + ), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step k k-k-00k-k-000 k-k-11k-k-111 k-k-22k-k-222 k-k-33k-k-333 k-k-44k-k-444 k-k-55k-k-555 k-k-66k-k-666 k-k-77k-k-777 kckc00kckc000 kckc11kckc111 kckc22kckc222 kckc33kckc333 kckc44kckc444 kckc55kckc555 kckc66kckc666 kckc77kckc777 k k+k+00k+k+000 k+k+11k+k+111 k+k+22k+k+222 k+k+33k+k+333 k+k+44k+k+444 k+k+55k+k+555 k+k+66k+k+666 k+k+77k+k+777 k-k-k-k- k-k-k-k- k-k-k-k- k-k-k-k- kckckckc kckckckc kckckckc kckckckc k+k+k+k+ k+k+k+k+ k+k+k+k+ k+k+k+k+ Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs: Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs: –(1) RED frame: 2D convolution of 1 st sourse QUAD: will be finalized and stored at the end of current step, –(2) GREEN frame: 2D convolution of 2 nd sourse QUAD: will be finalized and stored at the end of next step/epilog, –(Prev) YELLOW frame: 2D convolution of previous 2 nd sourse QUAD: will be finalized and stored at the end of current step Therefore, at the end of current step, 2 resulting 2D convolution QUADs– PREVIOUS 2 nd and CURRENT 1 st - will be stored. Therefore, at the end of current step, 2 resulting 2D convolution QUADs– PREVIOUS 2 nd and CURRENT 1 st - will be stored. Basic routine of algorithm: 2d convolution – 1 line Saved product QUADs from previous step Prev 1 Multiplication SSE4 mullo_epi32 Multiplication SSE4 mullo_epi32

Copyright © 2008 Intel Corporation. 10 As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly. As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly. Prolog step doesn’t include PREVIOUS sum computation and certainly doesn’t save it. Prolog step doesn’t include PREVIOUS sum computation and certainly doesn’t save it. The epilog step includes the very last 2D convolution QUAD computation and store that is fully similar to PREVIOUS computation in regular step. Finally, the above routine builds ONE 32bit line of 2D convolution resulting points. Finally, the above routine builds ONE 32bit line of 2D convolution resulting points. Basic routine of algorithm: 2d convolution – 1 line finalizing

Copyright © 2008 Intel Corporation. 11 Agenda Mathematics of 3D convolution Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution Main idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 line Basic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by line Main routine of algorithm: 3D convolution – line by line Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 12 To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop). To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop). For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine described above. For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine described above. Main routine of algorithm: 3D convolution – line by line Slice 1 (next) Slice 1 (next) Slice 0 (current) Slice 0 (current) Slice -1 (previous) Slice -1 (previous) Line -1 Line 0 Line 1 2D convolution Summing up Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following: Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following: Line -1 2D conv. Line 0 2D conv. Line +1 2D conv. Summing up bit 3D convolution After shift: actually – 16bit packs_epi Final 16bit 3D convolution EIGHT Store Shift

Copyright © 2008 Intel Corporation. 13 Agenda Mathematics of 3D convolution Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution Main idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 line Basic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by line Main routine of algorithm: 3D convolution – line by line Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 14 Parallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP for external (slices) loop, 3 32bit working lines for each thread are allocated. To parallelize the above algorithm by using OpenMP for external (slices) loop, 3 32bit working lines for each thread are allocated. See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores). See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores). 3 runs – equivalent of 3D gradient computation: SSE only SSE+OpenMP Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~ runs: SSE only SSE+OpenMP Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6 Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations ! Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !