Download presentation
Presentation is loading. Please wait.
Published byDarren Mitchell Modified over 9 years ago
1
1
2
2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss MKL and LAPACK Describe VML, its features and use
3
IntroductionThe Library Sections Performance Features Using the Library
4
MKL Addresses: Solvers (BLAS, LAPACK Eigenvector/eigenvalue solvers (BLAS, LAPACK) Some quantum chemistry needs (dgemm) PDEs, signal processing, seismic, solid-state physics (FFTs) Geneal scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)
5
Software Construction Geometric Transformation Don’t use Intel® Math Kernel (Intel® MKL) on … Don’t use Intel® MKL on “small” counts. Don’t call vector math functions on small n. § But you could use Intel ® Performance Primitives
6
6 BLAS (Basic Linear Algebra Subroutines Level 1 BLAS – vector-vector operations 15 function types 48 functions Level 2 BLAS – matrix-vector operations 26 function types 66 functions Level 3 BLAS – matrix-matrix operations 9 function types 30 functions Extended BLAS – level 1 BLAS for sparse vectors 8 function types 24 functions
7
7 LAPACK (linear algebra package Solvers and eigensolvers. Many hundreds of routines total There are more than 1000 total user callable and support routines Discrete Fourier Transformations (DFT) Mixed radix, multi-dimensional transforms Multi threaded VML (Vector Math Library) Set of vectorized transcendental functions Most of libm functions, but faster VSL (Vector Statistics Library) Set of vectorized ran
8
8 BLAS and LAPACK* are both Fortran Legacy of high performance computation VSL and VML have Fortran and C interfaces DFTs have Fortran 95 and C interfaces cblas intercate. It is more convenient for a C/C++ programmer to call BLAS
9
9 Support 32-bit and 64-bit Intel Processors Large set of examples and tests Extensive documentation
10
11/28/201510 The goal of all optimization is maximum speed. Resource limited optimization – exhaust one or more resource of system: CPU: Register use, FP units Cache: Keep data in cache as long as possible; deal with cache interleaving. TLBs: Maximally use data on each page Memory bandwidth: Minimally access memory Computer: Use all the processors available using threading System: Use all the nodes available (cluster software)
11
11 Most of Intel MKL could be threaded but: Limited resource is memory bandwidth Threading level 1 and level 2 BLAS are mostly ineffective (O(n) ) There are numerous opportunities for threading: Level 3 BLAS (O(n3) ) LAPACK* (O(n3) ) FFTs (O(n log(n) ) VML, VSL? Depends on processor and function All threading is via OpenMP* All Intel MKL is designed and compiled for thread safety
12
12 Scenario 1: ifort, BLAS, IA-32 processor: ifort myprog.f mkl_c.lib Scenario 2: CVF, LAPACK, IA-32 processor: f77 myprog.f mkl_s.lib Scenario 3: Statically link a C program with DLL linked at runtime: link myprog.obj mkl_c_dll.lib Note: Optimal binary code will execute at run time based on processor.
13
13
14
14
15
15 Most important LAPACK optimizations: Threading – effectively uses multiple CPUs Recursive factorization Reduces scalar time (Amdahl’s law: t=tscalar + tparallel/p Extends blocking further into the code No runtime library support required
16
16 One dimensional, two-dimensional, three-dimensional Multithreaded Mixed radix User – specified scaling, transform sign Transforms on imbedded matrices Multiple one-dimensional transforms on single cell Strides C and F90 interfaces
17
17 Basically a three-step process Create a descriptor Status = DftiCreate Descriptor (MDH,…) Commit the descriptor (instantiates it) Status = DftiCommit Descriptor (MDH) Perform the transform Status = DftiComputeForard (MDH, X) Optionally free the descriptor
18
18 Vector Math Library: Vectorized transcendental functions – like libm but better (faster) Interface: Have both Fortran and C interfaces Multiple accuracies High accuracy (<1ulp) Lower accuracy, faster (<4 ulps) Special value handling √(-a), sin(0), and so on Error handling – can not duplicate libm here
19
19 It is important for financial codes (Monte Carlo simulations) Exponentials, logarithms Other scientific codes depend on transcendental functions Error functions can be big time sinks in come codes
20
20 Vector Statistical Library (VSL) Set of random number generators (RNGs) Numerous non-uniform distributions VML used extensively for transformations Parallel computation support – some functions User can supply own BRNG or transformations Five basic RNGs (BRNGs) – bits, integer, FP ◦MCG31, R250, MRG32, MCG59, WH
21
21 Non-Uniform RNGs Gaussian (two methods) Exponential Laplace Weibull Cauchy Rayleigh Lognormal Gumbel
22
22 Using VSL Basically a 3-step Process Create a stream pointer. VSLStreamStatePtr stream; Create a stream. vslNewStream(&stream,VSL_BRNG_MC_G31, seed ); Generate a set of RNGs. vsRngUniform( 0, &stream, size, out, start, end ); Delete a stream (optional). vslDeleteStream(&stream);
23
23 Activity: Calculating Pi using a Monte Carlo method Compare the performance of C source code (RAND function) and VSL. Exercise control of the threading capabilities in MKL/VSL.
24
24 Performance Libraries: What’s Been Covered Intel® Math Kernel Library is a broad scientific/engineering math library. It is optimized for Intel® processors. It is threaded for effective use on SMP machines.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.