Intel Math Kernel Library (MKL) Clay P. Breshears, PhD Intel Software College NCSA Multi-core Workshop July 24, 2007.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Xushan Zhao, Yang Chen Application of ab initio In Zr-alloys for Nuclear Power Stations General Research Institute for Non- Ferrous metals of Beijing September.
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
© NVIDIA Corporation 2013 CUDA Libraries. © NVIDIA Corporation 2013 Why Use Library No need to reprogram Save time Less bug Better Performance = FUN.
Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Performance Libraries: Intel Math Kernel Library (MKL) Intel Software College.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
Intel® Processor Architecture: Multi-core Overview Intel® Software College.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Getting Reproducible Results with Intel® MKL 11.0
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Computer System Architectures Computer System Software
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Performance and Energy Efficiency of GPUs and FPGAs
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Sobolev Showcase Computational Mathematics and Imaging Lab.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Accelerating MATLAB with CUDA
On the Performance of Parametric Polymorphism in Maple Laurentiu DraganStephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
History of Microprocessor MPIntroductionData BusAddress Bus
Performance Optimization Getting your programs to run faster CS 691.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
JAVA AND MATRIX COMPUTATION
Performance Optimization Getting your programs to run faster.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
A few issues on the design of future multicores André Seznec IRISA/INRIA.
INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Single Node Optimization Computational Astrophysics.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Tuning Threaded Code with Intel® Parallel Amplifier.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Appendix C Graphics and Computing GPUs
Presented by: Tim Olson, Architect
Many-core Software Development Platforms
Presentation transcript:

Intel Math Kernel Library (MKL) Clay P. Breshears, PhD Intel Software College NCSA Multi-core Workshop July 24, 2007

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Performance Libraries: Intel® Math Kernel Library (MKL) Agenda Performance Features The Library Sections BLAS LAPACK* DFTs VML VSL SciMark 2.0 Optimization Case Study (from Henry Gabb) SciMark 2.0 overview Tuning with the Intel compiler Tuning with the Intel Math Kernel Library

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Purpose Performance, Performance, Performance! Intel’s engineering, scientific, and financial math library Addresses: Solvers (BLAS, LAPACK) Eigenvector/eigenvalue solvers (BLAS, LAPACK) Some quantum chemistry needs (dgemm) PDEs, signal processing, seismic, solid-state physics (FFTs) General scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)] Tuned for Intel® processors – current and future

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Purpose – Don’ts But don’t use Intel® Math Kernel (Intel® MKL) on … Don’t use Intel® MKL on “small” counts Don’t call vector math functions on small n X’ Y’ Z’ W’ XYZWXYZW = 4x4 Transformation matrix Geometric Transformation But you could use Intel ® Performance Primitives

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Environment Support 32-bit and 64-bit Intel® processors Large set of examples and tests Extensive documentation Windows*Linux* CompilersIntel, MicrosoftIntel, Gnu Libraries.dll,.lib.a,.so

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Performance Libraries: Intel® Math Kernel Library (MKL) Resource Limited Optimization The goal of all optimization is maximum speed Resource limited optimization – exhaust one or more resource of system: CPU: Register use, FP units Cache: Keep data in cache as long as possible; deal with cache interleaving TLBs: Maximally use data on each page Memory bandwidth: Minimally access memory Computer: Use all the processors/cores available using threading System: Use all the nodes available (cluster software)

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Performance Libraries: Intel® Math Kernel Library (MKL) Threading Most of Intel® Math Kernel Library could be threaded but: Limited resource is memory bandwidth Threading level 1 and level 2 BLAS are mostly ineffective ( O(n) ) There are numerous opportunities for threading: Level 3 BLAS ( O(n 3 ) ) LAPACK* ( O(n 3 ) ) FFTs ( O(n log n ) ) VML, VSL ? depends on processor and function All threading is via OpenMP* All Intel MKL is designed and compiled for thread safety

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Performance Libraries: Intel® Math Kernel Library (MKL) SciMark 2.0 Produced by the National Institute of Standards and Technology ANSI C and Java versions available Five floating-point-intensive kernels FFT: Compute a complex 1D FFT SOR: Jacobi successive over-relaxation in 2D MC: Compute  by Monte Carlo integration MV: Sparse matrix-vector multiplication LU: Dense matrix LU factorization

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Performance Libraries: Intel® Math Kernel Library (MKL) SciMark 2.0 Problem Sizes Benchmark Problem Size SmallLarge FFTN = 1024N = SOR100 x x 1000 MC Problem size not fixed, no distinction between small and large problems MV N = 1000 NZ = 5000 N = NZ = LU100 x x 1000

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Performance Libraries: Intel® Math Kernel Library (MKL) Benchmark System Hardware CPU (dual-processor system)3.6 GHz Xeon (2 MB L2 cache) EM64T MotherboardIntel Server Board SE7520AF2 Memory512 MB DDR2 BIOS VersionP06 Adjacent Cache Line PrefetchON Hardware PrefetchON Hyper-Threading TechnologyOFF Software Operating systemRed Hat Enterprise Linux AS3 Linux kernel EL #1 SMP Intel C++ Compiler for Linux8.1 (l_cce_pc_ ) Intel Cluster MKL7.2 (l_cluster_mkl_ ) GNU C Compilergcc 3.2.3

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Performance Libraries: Intel® Math Kernel Library (MKL) GNU Performance Baseline Aggressive optimization significantly improves performance relative to the default optimization level. The following gcc options were used to establish baseline performance: –O3 –march=nocona –ffast-math –mfpmath=sse

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 Performance Libraries: Intel® Math Kernel Library (MKL) Intel C++ Compiler for Linux Performance Automatic vectorization Streaming SIMD Extensions 3 IPO and PGO Automatic parallelization and OpenMP support Automatic CPU dispatch Much more... Compatibility Source and object compatible with gcc and g++ Supports GNU inline ASM ANSI/ISO C/C++ standards compliance Conforms to the C++ ABI standard Integrated with the Eclipse IDE

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 13 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning SciMark 2.0 with the Intel Compiler The Intel C++ Compiler for Linux improves SciMark 2.0 performance relative to the GNU baseline. Intel compiler options: –O3 –xP –ipo –fno-alias.

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 14 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Contents BLAS BLAS (Basic Linear Algebra Subroutines) Level 1 BLAS – vector-vector operations 15 function types 48 functions Level 2 BLAS – matrix-vector operations 26 function types 66 functions Level 3 BLAS – matrix-matrix operations 9 function types 30 functions Extended BLAS – level 1 BLAS for sparse vectors 8 function types 24 functions

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 15 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Contents LAPACK LAPACK (linear algebra package) Solvers and eigensolvers. Many hundreds of routines total! There are more than 1000 total user callable and support routines DFTs (Discrete Fourier transforms) Mixed radix, multi-dimensional transforms Multithreaded VML (Vector Math Library) Set of vectorized transcendental functions Most of libm functions, but faster VSL (Vector Statistical Library) Set of vectorized random number generators

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 16 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Contents BLAS and LAPACK* are both Fortran Legacy of high performance computation VSL and VML have Fortran and C interfaces DFTs have Fortran 95 and C interfaces cblas interface available More convenient for a C/C++ programmer

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 17 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Optimizations in LAPACK* Most important LAPACK optimizations: Threading – effectively uses multiple cores Recursive factorization Reduces scalar time (Amdahl’s law: t = t scalar + t parallel /p) Extends blocking further into the code No runtime library support required

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning the SciMark 2.0 LU Kernel Replacing the SciMark 2.0 LU kernel with the LAPACK dgetrf function requires attention to detail: SciMark 2.0 is written in C LAPACK defines a Fortran interface C is call-by-value Fortran is call-by-reference C uses row-major ordering Fortran uses column-major ordering For best performance, dgetrf requires data to be contiguous in memory SciMark 2.0 LU kernel allocates a 2D array as pointers-to-pointers (not necessarily contiguous in memory)

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning the SciMark 2.0 LU Kernel The Intel MKL Lapack significantly improves performance over the original SciMark 2.0 LU source code.

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Contents Discrete Fourier Transforms One dimensional, two-dimensional, three-dimensional… Multithreaded Mixed radix User-specified scaling, transform sign Transforms on embedded matrices Multiple one-dimensional transforms on single call Strides C and F90 interfaces; FFTW interface support

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 Performance Libraries: Intel® Math Kernel Library (MKL) Using the Intel® Math Kernel Library DFTs Basically a 3-step Process Create a descriptor Status = DftiCreateDescriptor(MDH, …) Commit the descriptor (instantiates it) Status = DftiCommitDescriptor(MDH) Perform the transform Status = DftiComputeForward(MDH, X) Optionally free the descriptor MDH: MyDescriptorHandle

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning the SciMark 2.0 FFT Kernel #include int N = 1024; // Size of SciMark 2.0 small FFT problem double scale = 1.0 / (double)N; double *x = RandomVector ((2 * N), R); // SciMark creates a random vector // of size 2*N to hold real and // imaginary parts DFTI_DESCRIPTOR *dftiHandle; // Structure for MKL DFT descriptor DftiCreateDescriptor (&dftiHandle, // Transform descriptor DFTI_DOUBLE, // Precision DFTI_COMPLEX, // Complex-to-complex 1, // Number of dimensions N); // Size of transform // Apply scaling factor to backward transform DftiSetValue (dftiHandle, DFTI_BACKWARD_SCALE, scale); DftiCommitDescriptor (dftiHandle); DftiComputeForward (dftiHandle, x); // Apply DFT to array x DftiComputeBackward (dftiHandle, x); DftiFreeDescriptor (&dftiHandle);

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 23 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning the SciMark 2.0 FFT Kernel The Intel MKL DFT significantly improves performance over the original SciMark 2.0 FFT source code.

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 24 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Contents Vector Math Library (VML) Vector Math Library: vectorized transcendental functions – like libm but better (faster) Interface: Have both Fortran and C interfaces Multiple accuracies High accuracy ( < 1 ulp ) Lower accuracy, faster ( < 4 ulps ) Special value handling √(-a), sin(0), and so on Error handling – can not duplicate libm here

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 25 Performance Libraries: Intel® Math Kernel Library (MKL) VML: Why Does It Matter? It is important for financial codes (Monte Carlo simulations) Exponentials, logarithms Other scientific codes depend on transcendental functions Error functions can be big time sinks in some codes

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 26 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Math Kernel Library Contents Vector Statistical Library (VSL) Set of random number generators (RNGs) Numerous non-uniform distributions VML used extensively for transformations Parallel computation support – some functions User can supply own BRNG or transformations Five basic RNGs (BRNGs) MCG31, R250, MRG32, MCG59, WH

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 27 Performance Libraries: Intel® Math Kernel Library (MKL) Non-Uniform RNGs Gaussian (two methods) Exponential Laplace Weibull Cauchy Rayleigh Lognormal Gumbel

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 28 Performance Libraries: Intel® Math Kernel Library (MKL) Using VSL Basically a 3-step Process Create a stream pointer VSLStreamStatePtr stream; Create a stream vslNewStream(&stream,VSL_BRNG_MC_G31,seed ); Generate a set of RNGs vsRngUniform( 0,&stream,size,out,start,end ); Delete a stream (optional) vslDeleteStream(&stream);

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 29 Performance Libraries: Intel® Math Kernel Library (MKL) Calculating Pi by Monte Carlo Loop I = 1 to N_samples x.coor = random [0..1] y.coor = random [0..1] dist = sqrt (x^2 + y^2) if dist <= 1 hits = hits + 1 Pi = 4 * hits / N_samples r

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 30 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning the SciMark 2.0 MC Kernel #include double MonteCarlo_integrate (int Num_samples) { int i, j, blocks, under_curve = 0; static double rnBuf[2 * BLOCK_SIZE]; double rnX, rnY; VSLStreamStatePtr stream; blocks = Num_samples / BLOCK_SIZE; vslNewStream (&stream, VSL_BRNG_MCG31, SEED); for (i = 0; i < blocks; i++) { vdRngUniform (VSL_METHOD_DUNIFORM_STD, stream, (2 * BLOCK_SIZE), rnBuf, 0.0, 1.0); for (j = 0; j < BLOCK_SIZE; j++) { rnX = rnBuf[2*j]; rnY = rnBuf[2*j+1]; if (sqrt(rnX*rnX + rnY*rnY) <= 1.0) under_curve++; } vslDeleteStream (&stream); return ((double) under_curve / Num_samples) * 4.0; }

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 31 Performance Libraries: Intel® Math Kernel Library (MKL) Tuning the SciMark 2.0 MC Kernel The Intel MKL VSL significantly improves performance over the original SciMark 2.0 MC source code.

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 32 Performance Libraries: Intel® Math Kernel Library (MKL) Best SciMark 2.0 Single Node Performance Small Problems Small Problems (MFLOPS) GNUIntelSpeedup FFT SOR MC MV LU Comp

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 33 Performance Libraries: Intel® Math Kernel Library (MKL) Best SciMark 2.0 Single Node Performance Large Problems Large Problems (MFLOPS) GNUIntelSpeedup FFT SOR MC MV LU Comp

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 34 Performance Libraries: Intel® Math Kernel Library (MKL) Intel® Cluster MKL Intel Cluster MKL is a superset of MKL for solving large linear algebra problems on a cluster Intel Cluster MKL contains: ScaLAPACK (Scalable LAPACK) BLACS (Basic Linear Algebra Communication Subprograms) Supports MPICH and the Intel MPI Library

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 35 Performance Libraries: Intel® Math Kernel Library (MKL) Data Layout Critical to Parallel Performance ScaLAPACK uses 2D block-cyclic data distribution Example layouts of lower triangular matrix for four processes

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 36 Performance Libraries: Intel® Math Kernel Library (MKL) Parallelizing the SciMark 2.0 LU Kernel with Intel® Cluster MKL 1.Initialize the process grid 2.Create a descriptor for each distributed matrix 3.Replace the call to dgetrf with pdgetrf (the ‘p’ is for parallel) Result: LU factorization of a x matrix on an 8- node, dual 3.0 GHz Xeon cluster achieves MFLOPS.

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 37 Performance Libraries: Intel® Math Kernel Library (MKL) Performance Libraries: Intel® MKL What’s Been Covered Intel® Math Kernel Library is a broad scientific/engineering math library It is optimized for Intel® processors It is threaded for effective use on multi-core and SMP machines The Intel C++ Compiler for Linux improves SciMark 2.0 performance without requiring code modifications With minor code modifications, Intel MKL dramatically improves the FFT, MC, and LU kernels Some SciMark 2.0 kernels benefit from parallel computing

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 38 Performance Libraries: Intel® Math Kernel Library (MKL) Useful Links Intel Software Products Intel Software Network Intel Software College SciMark 2.0

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 39 Performance Libraries: Intel® Math Kernel Library (MKL)