Performance Libraries: Intel Math Kernel Library (MKL) Intel Software College.

Slides:

Advertisements

Similar presentations

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.

Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Xushan Zhao, Yang Chen Application of ab initio In Zr-alloys for Nuclear Power Stations General Research Institute for Non- Ferrous metals of Beijing September.

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.

Selective, Embedded Just-in- Time Specialization (SEJITS) As a platform for implementing communication-avoiding algorithms accessible from Python.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.

1cs542g-term Notes  Assignment 1 is out (questions?)

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.

Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.

INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.

Intel® Processor Architecture: Multi-core Overview Intel® Software College.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

INTEL CONFIDENTIAL Parallel Decomposition Methods Introduction to Parallel Programming – Part 2.

Getting Reproducible Results with Intel® MKL 11.0

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.

What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Computer System Architectures Computer System Software

Lecture 8: Caffe - CPU Optimization

Recognizing Potential Parallelism Intel Software College Introduction to Parallel Programming – Part 1.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Sd&m software design & management GmbH & Co. KG Thomas-Dehler-Straße München Telefon (0 89) Telefax (0 89)

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel Math Kernel Library (MKL) Clay P. Breshears, PhD Intel Software College NCSA Multi-core Workshop July 24, 2007.

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

© 2012 Autodesk A Fast Modal (Eigenvalue) Solver Based on Subspace and AMG Sam MurgieJames Herzing Research ManagerSimulation Evangelist.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Single Node Optimization Computational Astrophysics.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

Tuning Threaded Code with Intel® Parallel Amplifier.

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.

TLDK Transport Layer Development Kit

05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.

Presented by: Tim Olson, Architect

Many-core Software Development Platforms

Array Processor.

ე ვ ი ო Ш Е Т И О А С Д Ф К Ж З В Н М W Y U I O S D Z X C V B N M

VTune: Intel’s Visual Tuning Environment

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Performance Libraries: Intel Math Kernel Library (MKL) Intel Software College

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Performance Libraries: Intel® Math Kernel Library (MKL) Agenda Introduction Purpose of Library Intel® Math Kernel Library (Intel® MKL) Contents Performance Features Resource Limited Optimization Threading Using the Library The Library Sections BLAS LAPACK* DFTs VML VSL

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Performance Libraries: Intel® Math Kernel Library (MKL) Math Kernel Library Purpose Addresses: Solvers (BLAS, LAPACK) Eigenvector/eigenvalue solvers (BLAS, LAPACK) Some quantum chemistry needs (dgemm) PDEs, signal processing, seismic, solid-state physics (FFTs) General scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)]

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Performance Libraries: Intel® Math Kernel Library (MKL) Math Kernel Library Purpose – Don’ts But don’t use Intel® Math Kernel (Intel® MKL) on … Don’t use Intel® MKL on “small” counts. Don’t call vector math functions on small n. X’ Y’ Z’ W’ XYZWXYZW = 4x4 Transformation matrix Geometric Transformation § But you could use Intel ® Performance Primitives.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Performance Libraries: Intel® Math Kernel Library (MKL) Math Kernel Library Contents BLAS (Basic Linear Algebra Subroutines) Level 1 BLAS – vector-vector operations 15 function types 48 functions Level 2 BLAS – matrix-vector operations 26 function types 66 functions Level 3 BLAS – matrix-matrix operations 9 function types 30 functions Extended BLAS – level 1 BLAS for sparse vectors 8 function types 24 functions

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Performance Libraries: Intel® Math Kernel Library (MKL) Math Kernel Library Contents LAPACK (linear algebra package) Solvers and eigensolvers. Many hundreds of routines total! There are more than 1000 total user callable and support routines DFTs (Discrete Fourier transforms) Mixed radix, multi-dimensional transforms Multithreaded VML (Vector Math Library) Set of vectorized transcendental functions Most of libm functions, but faster VSL (Vector Statistical Library) Set of vectorized random number generators

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Performance Libraries: Intel® Math Kernel Library (MKL) Math Kernel Library Contents BLAS and LAPACK* are both Fortran. Legacy of high performance computation VSL and VML have Fortran and C interfaces. DFTs have Fortran 95 and C interfaces. cblas interface. It is more convenient for a C/C++ programmer to call BLAS.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Performance Libraries: Intel® Math Kernel Library (MKL) Math Kernel Library (Intel ® MKL) Environment Support 32-bit and 64-bit Intel ® processors Large set of examples and tests Extensive documentation Windows*Linux* CompilersIntel, CVF, MicrosoftIntel, Gnu Libraries.dll,.lib.a,.so

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Performance Libraries: Intel® Math Kernel Library (MKL) Resource Limited Optimization The goal of all optimization is maximum speed. Resource limited optimization – exhaust one or more resource of system: CPU: Register use, FP units. Cache: Keep data in cache as long as possible; deal with cache interleaving. TLBs: Maximally use data on each page. Memory bandwidth: Minimally access memory. Computer: Use all the processors available using threading. System: Use all the nodes available (cluster software).

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Performance Libraries: Intel® Math Kernel Library (MKL) Threading Most of Intel® Math Kernel Library (Intel® MKL) could be threaded but: Limited resource is memory bandwidth. Threading level 1 and level 2 BLAS are mostly ineffective ( O(n) ) There are numerous opportunities for threading: Level 3 BLAS ( O(n3) ) LAPACK* ( O(n3) ) FFTs ( O(n log(n) ) VML, VSL ? depends on processor and function All threading is via OpenMP*. All Intel MKL is designed and compiled for thread safety.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Performance Libraries: Intel® Math Kernel Library (MKL) Linking with Intel® Math Kernel Library (Intel® MKL) Scenario 1: ifort, BLAS, IA-32 processor: ifort myprog.f mkl_c.lib Scenario 2: CVF, LAPACK, IA-32 processor: f77 myprog.f mkl_s.lib Scenario 3: Statically link a C program with DLL linked at runtime: link myprog.obj mkl_c_dll.lib Note: Optimal binary code will execute at run time based on processor.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 Performance Libraries: Intel® Math Kernel Library (MKL) Matrix Multiplication Roll Your Own/Dot Product for( i = 0; i < n; i++ ){ for( j = 0; j < m; j++ ){ for( k = 0; k < kk; k++ ) c[i][j] += a[i][k] * b[k][j]; }} for( i = 0; i < n; i++ ){ for( j = 0; j < m; j++ ) c[i][j] = cblas_ddot( n, &a[i], incx,&b[0][j], incy); } Roll Your Own ddot

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 13 Performance Libraries: Intel® Math Kernel Library (MKL) Matrix Multiplication DGEMV/DGEMM for( i = 0; i < n; i++ ) cblas_dgemv( CBLAS_RowMajor, CBLAS_NoTrans, m, n, alpha, a, lda, &b[0][i], ldb, beta, &c[0][i], ldc ); dgemv Cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, kk, alpha, b, ldb, a, lda, beta, c, ldc ); dgemm

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 14 Performance Libraries: Intel® Math Kernel Library (MKL) Activity 1: DGEMM Compare the performance of matrix multiply as implemented by C source code, DDOT, DGEMG and DGEMM. Exercise control of the threading capabilities in MKL/BLAS.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 15 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Optimizations in LAPACK* Most important LAPACK optimizations: Threading – effectively uses multiple CPUs Recursive factorization Reduces scalar time (Amdahl’s law: t = tscalar + tparallel/p) Extends blocking further into the code No runtime library support required

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 16 Performance Libraries: Intel® Math Kernel Library (MKL) Discrete Fourier Transforms One dimensional, two-dimensional, three-dimensional… Multithreaded Mixed radix User-specified scaling, transform sign Transforms on imbedded matrices Multiple one-dimensional transforms on single call Strides C and F90 interfaces

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 17 Performance Libraries: Intel® Math Kernel Library (MKL) Using the Intel® Math Kernel Library DFTs Basically a 3-step Process Create a descriptor. Status = DftiCreateDescriptor(MDH, …) Commit the descriptor (instantiates it). Status = DftiCommitDescriptor(MDH) Perform the transform. Status = DftiComputeForward(MDH, X) Optionally free the descriptor. MDH: MyDescriptorHandle

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Performance Libraries: Intel® Math Kernel Library (MKL) Vector Math Library (VML) Features/Issues Vector Math Library: vectorized transcendental functions – like libm but better (faster) Interface: Have both Fortran and C interfaces Multiple accuracies High accuracy ( < 1 ulp ) Lower accuracy, faster ( < 4 ulps ) Special value handling √(-a), sin(0), and so on Error handling – can not duplicate libm here

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Performance Libraries: Intel® Math Kernel Library (MKL) VML: Why Does It Matter? It is important for financial codes (Monte Carlo simulations). Exponentials, logarithms Other scientific codes depend on transcendental functions. Error functions can be big time sinks in some codes. And so on

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Performance Libraries: Intel® Math Kernel Library (MKL) Vector Statistical Library (VSL) Set of random number generators (RNGs) Numerous non-uniform distributions VML used extensively for transformations Parallel computation support – some functions User can supply own BRNG or transformations Five basic RNGs (BRNGs) – bits, integer, FP MCG31, R250, MRG32, MCG59, WH

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 Performance Libraries: Intel® Math Kernel Library (MKL) Non-Uniform RNGs Gaussian (two methods) Exponential Laplace Weibull Cauchy Rayleigh Lognormal Gumbel

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Performance Libraries: Intel® Math Kernel Library (MKL) Using VSL Basically a 3-step Process Create a stream pointer. VSLStreamStatePtr stream; Create a stream. vslNewStream(&stream, VSL_BRNG_MC_G31, seed ); Generate a set of RNGs. vsRngUniform( 0, &stream, size, out, start, end ); Delete a stream (optional). vslDeleteStream(&stream);

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 23 Performance Libraries: Intel® Math Kernel Library (MKL) Activity: Calculating Pi using a Monte Carlo method Compare the performance of C source code (RAND function) and VSL. Exercise control of the threading capabilities in MKL/VSL.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 24 Performance Libraries: Intel® Math Kernel Library (MKL) Performance Libraries: Intel® MKL What’s Been Covered Intel® Math Kernel Library is a broad scientific/engineering math library. It is optimized for Intel® processors. It is threaded for effective use on SMP machines.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 25 Performance Libraries: Intel® Math Kernel Library (MKL)