Lecture 8: Caffe - CPU Optimization

Slides:

Advertisements

Similar presentations

Advertisements

Introductions to Parallel Programming Using OpenMP

Profiling your application with Intel VTune at NERSC

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see:

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Parallel Programming by Tiago Sommer Damasceno Using OpenMP

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Programming with Shared Memory Introduction to OpenMP

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.

1 Datamation Sort 1 Million Record Sort using OpenMP and MPI Sammie Carter Department of Computer Science N.C. State University November 18, 2004.

OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.

Introduction to OpenMP 曾奕倫 Department of Computer Science & Engineering Yuan Ze University.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

COM503 Parallel Computer Architecture & Programming

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

OpenMP China MCP.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June

OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

O PEN MP (O PEN M ULTI -P ROCESSING ) David Valentine Computer Science Slippery Rock University.

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Threaded Programming Lecture 2: Introduction to OpenMP.

Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Single Node Optimization Computational Astrophysics.

1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Symbolic Analysis of Concurrency Errors in OpenMP Programs Presented by : Steve Diersen Contributors: Hongyi Ma, Liqiang Wang, Chunhua Liao, Daniel Quinlen,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Instructor: Justin Hsia 7/17/2012Summer Lecture #171 CS 61C: Great Ideas in Computer Architecture OpenMP, Transistors.

COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Lecture 6: Lecturer: Simon Winberg Distributed Memory Systems & MPI vs. OpenMP vula share MPI OpenMP + an observation of 8 March International Womans Day.

Introduction to Parallel Computing What is parallel computing? A computational method that utilizes multiple processing elements to solve a.

Introduction to OpenMP

SHARED MEMORY PROGRAMMING WITH OpenMP

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

CS427 Multicore Architecture and Parallel Computing

Introduction to OpenMP

September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.

Lecture 14: Distributed Memory Systems & MPI vs. OpenMP

Multi-core CPU Computing Straightforward with OpenMP

September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.

September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.

Programming with Shared Memory Introduction to OpenMP

Hybrid Parallel Programming

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Hybrid Parallel Programming

Introduction to OpenMP

Multithreading Why & How.

Introduction to Parallel Computing

Hybrid Parallel Programming

Instructor: Michael Greenbaum

OpenMP Parallel Programming

Presentation transcript:

Lecture 8: Caffe - CPU Optimization boris. ginzburg@intel.com

Agenda Profiling of application with Vtune Caffe with BLAS Parallelization with OpenMP

Vtune: getting started

Vtune: getting started get non-commercial license and install Intel Parallel Studio XE 2013: https://software.intel.com/en-us/non-commercial-software-development ( Includes Intel C++ Compiler (icc), Vtune, and MKL ) Buid application as usual (all optimization, but with debug information (-02 -g flag) Run amplxe-gui Create vtune project Run basic Hot-spot analysis Analyze performance Exercise: Analyze caffe training using mnist example.

Caffe with BLAS

Caffe with BLAS Caffe is based on BLAS. CPU: GPU: OpenBLAS: http://www.openblas.net/ - very good open source library Intel MKL https://software.intel.com/en-us/intel-mkl - even better , closed source , need license ATLAS http://math-atlas.sourceforge.net/ - slow GPU: cuBLAS (part of toolkit): https://developer.nvidia.com/cublas Basic Linear Algebra Subroutines –set of low-level kernel subroutines for linear algebra: BLAS1: vector – vector operations; BLAS2: matrix – vector operations (e.g. “matrix vector multiply” ); BLAS3: matrix – matrix operations (like matrix – matrix multiply).

BLAS: Foundation for Math Computing BLAS is used as a building block in higher-level math libraries as LINPACK, MKL or PLASMA etc. Computer Vision Machine learning Deep Learning PARALLEL LINEAR ALGEBRA PACKAGE BASIC LINEAR ALGEBRA SUBROUTINES BLAS-1 BLAS-2 BLAS-3

Exercise Switch between ATLAS, OpenBLAS, and MKL in Makefile.config, compare performance on CIFAR-10 Download new version of OpenBLAS and build it. Compare performance.

openMP

Projects OpenMP : A lot of good tutorials on-line: an easy, portable and scalable way to parallelize applications for many cores. Multi-threaded, shared memory model (like pthreads) a standard API + omp pragmas are supported by major C/C++ , Fortran compilers (gcc, icc, etc). A lot of good tutorials on-line: https://computing.llnl.gov/tutorials/openMP/ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

OpenMP programming model Fork – Join parallelism

Example 1 int main (int argc, char *argv[ ]) { int i; float a[N], b[N], c[N]; for (i=0; i < N; i++) { a[i] = b[i] = 1.0; } for (i=0; i<N; i++) { c[ i ] = a[ i ] + b[ i ]

Example 1 #include <omp.h> int main (int argc, char *argv[ ]) { int i; float a[N], b[N], c[N]; #pragma omp parallel for for (i=0; i < N; i++) { a[i] = b[i] = 1.0; } for (i=0; i<N; i++) { c[ i ] = a[ i ] + b[ i ]

Example 1 #include <omp.h> #include <stdio.h> #include <stdlib.h> #define N 100 int main (int argc, char *argv[ ]) { int nthreads, tid, i; float a[N], b[N], c[N]; nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); #pragma omp parallel for for (i=0; i < N; i++) { a[i] = b[i] = 1.0; }

Example 1 #pragma omp parallel for for (i=0; i<N; i++) { c[ i ] = a[ i ] + b[ i ]; tid = omp_get_thread_num(); printf("Thread %d: c[%d]= %f\n", tid , i , c[ i ]); }

Compiling, linking etc You need to add flag –fopenmp to gcc: gcc -fopenmp omp_vecadd.c -o vecadd icc -openmp omp_vecadd.c -o vecadd Control number of threads through setenv OMP_NUM_THREADS 8

Exercise Implement: Add openmp support to relu, and max-pooling layers vector dot-product: c=<x,y> matrix-matrix multiply, 2D matrix convolution Add openmp support to relu, and max-pooling layers