Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors

Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors
Sarah Knepper, Murat E. Guney, Kazushige Goto, Shane Story, Arthur Araujo Mitrano, Tim Costa, and Louise Huot Intel® Math Kernel Library (Intel® MKL)

Agenda Overview of Intel® Xeon and Xeon Phi™ Processors
Focus on Xeon Phi™ x200 Processor Overview of Batched Linear Algebra Focus on Intel MKL batched API Performance on Xeon Phi™ x200 Processor

How Intel MKL gets performance
More cores  More Threads  Wider vectors Intel® Xeon® Processor 64-bit Intel® Xeon® Processor 5100 series Intel® Xeon® Processor 5500 series Intel® Xeon® Processor 5600 series Intel® Xeon® Processor E v2 series E v3 series v4 series Future Intel® Xeon® Processor1 Up to Core(s) 1 2 4 6 12 18-22 TBD Up to Threads 8 24 36-44 SIMD Width 128 256 512 Vector ISA Intel® SSE3 Intel® SSE4- 4.1 Intel® SSE 4.2 Intel® AVX Intel® AVX2 Intel® AVX-512 Intel® Xeon Phi™ x100 Coprocessor (KNC) Intel® Xeon Phi™ x200 Processor & Coprocessor (KNL) 61 72 244 288 512 IMCI 512 Intel® AVX-512 Future Intel® Xeon Phi™ (KNH) TBD First generation of Intel Xeon Phi co-processors (codename "Knight's Corner", abbreviated KNC) supports 512-bit SIMD instruction set called "Intel® Initial Many Core Instructions" (abbreviated Intel® IMCI). Product specification for launched and shipped products available on ark.intel.com. All dates and products specified are for planning purposes only and are subject to change without notice. 1. Not launched or in planning.

Intel® Xeon Phi™ Product Family x200
Intel® Xeon Phi™ Processor Intel® Xeon Phi™ Coprocessor x200 with integrated Intel® Omni-Path Fabric Host Processor in Groveport Platform Self-boot Intel® Xeon Phi™ processor Ingredient of Grantley Platforms Requires Intel® Xeon® processor host

Current Intel® Xeon and Xeon Phi™ Platforms
Broadwell (14 nm process) Intel’s Foundation of HPC Performance Up to 22 cores, Hyperthreading ~66 GB/s stream memory BW (4 ch. DDR4 2400) AVX2: 256-bit (4 DP, 8 SP per register) Xeon Phi* Knights Landing (14nm process) Optimized for highly parallelized compute intensive workloads Common programming model & S/W tools with Xeon processors, enabling efficient app readiness and performance tuning Up to 72 cores, 490 GB/s stream BW, on-die 2D mesh AVX512: 512-bit (8 DP, 16 SP per register) Each KNL core has 2 VPU units (AVX512) 2x8 FMAs per core for DP 2x16 FMAs per core for SP

Intel® Xeon Phi™ Processor (Knights Landing)
A Highly-Parallel CPU that Transcends GPU Accelerators Up to 36 Enabled Tiles Significant improvement in scalar and vector performance Topple Memory Wall Integrated 16GB memory Raise Memory Ceiling Platform memory up to 384 GB (DDR4) No PCIe Bottleneck Bootable host processor, Boots standard OS Integrated Memory Platform Memory (DDR4) 2 VPU Core HUB 1 MB L2 TILE Bootable Host CPU Integrated Fabric Processor Package Scale Out Seamlessly Efficient scaling like Intel® Xeon® processors Reduce Cost Dual-port Intel® Omni-Path Fabric Run x86 Workloads Binary-compatible with mainline IA Memory on package: innovative memory architecture for high bandwidth and high capacity Core: Changed from Knights Corner (KNC) to KNL. Based on 2-wide OoO Silvermont™ Microarchitecture, but with many changes for HPC. 4 thread/core. Deeper OoO. Better RAS. Higher bandwidth. Larger TLBs. L2: 1MB 16-way. 1 Line Read and ½ Line Write per cycle. Coherent across all Tiles CHA: Caching/Home Agent. Distributed Tag Directory to keep L2s coherent. MESIF protocol. 2D-Mesh connections for Tile 1Reduced cost based on Intel internal estimate comparing cost of discrete networking components with the integrated fabric solution

Introduction to batched linear algebra
Execute independent general matrix multiplication (GEMM) operations simultaneously with a single function call No data dependency between the function calls Use cases: numerical integration, sparse solvers, rotation matrices, etc… C1 = alpha . op(A1) . op(B1) + beta . C1 Execute in parallel if there is no pointer aliasing C2 = alpha . op(A1) . op(B1) + beta . C2 C3 = alpha . op(A2) . op(B1) + beta . C3 C2 = alpha . op(A2) . op(B2) + beta . C2 Wait for a previous write to C2 HPC applications often operate on large numbers of very small matrices (3x3, 5x5, 6x6, 9x9, 15x15) e.g. FEM models, preconditioner application, computational lithography, collaborative filtering

Batched linear algebra in Intel MKL
Batched functionality in Intel MKL: Batched GEMM/GEMM3M functionality from Intel MKL 11.3 Batched TRSM functionality starting from Intel MKL 2018 Beta Better utilize multi-/many-core processors for small/medium sizes Minimize library overheads for small sizes Error checking, dispatching, function overheads Allows batching calls with different parameters Exposes performance opportunities Intel MKL is free with the community license Performance opportunities: prefetching, cross-matrix vectorization, splitting ALL batched matrices among threads

Intel MKL Batch API The API allows batching calls with different parameters Group: a number of GEMM operations with same parameters Batch: a number of GEMM groups GEMM_BATCH executes multiple groups simultaneously Two additional parameters to the traditional GEMM functions group_count (integer): total number of groups group_size (integer array): the number of GEMMs within each group A consistent level of redirection for GEMM parameters Integer becomes array of integers Matrix pointer becomes array of matrix pointers

GEMM_BATCH in Intel MKL - Group Concept
Group: set of GEMM operations with same input parameters (except for matrix pointers) Transpose, size, leading dimension, alpha, beta One or more groups per GEMM_BATCH call

Interface of various batched GEMMs versus GEMM
Argument Description BLAS sgemm Magma* magma_sgemm_batched magma_sgemm_vbatched Nvidia* cublasSgemmBatched Intel MKL sgemm_batch HANDLE handle to the cuBLAS library context -- cublasHandle_t TRANSA op(A) char char * TRANSB op(B) M rows of op(A)/C int int * N columns of op(B)/C K columns of op(A)/rows of op(B) ALPHA alpha float float * A input matrix float ** LDA leading dimension of A B LDB leading dimension of B BETA beta C input/output matrix LDC leading dimension of C BATCHCOUNT number of matrices QUEUE queue to execute in magma_queue_t GROUP_COUNT number of groups GROUP_SIZES number of matrices in each group For simplicity, some enum types reduced to char or int. Table idea and some data from Performance, Design, and Autotuning of Batched GEMM for GPUs by Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. cublasSgemmStridedBatched

Partitioning algorithms
How to partition the matrices among threads Depends on number of matrices and relative group sizes Look at 2 different algorithms M1 M2 M3 M4 M5 Matrices: C0 C1 C2 Cores:

Partitioning algorithm 1
Partition the matrices across all GEMM groups Keep partitioning until we have enough partitions or partitions are too small Distribute the matrices in balanced, contiguous partitions M1 M2 M3 M4 M5 Matrices: C0 C1 C2 C0 C1 C2 Cores:

Partitioning algorithm 2
Partition the matrices inside each GEMM group Biased toward no partitioning due to the kernel performance Keep partitioning until we have enough partitions or partitions are too small Distribute the matrix partitions to the threads in a round-robin fashion M1 M2 M3 M4 M5 Matrices: C0 C1 C2 C0 C1 C2 C0 C1 C2 C0 C1 C2 Cores:

GEMM_BATCH on Intel® Xeon Phi™ Processor 7250
Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 2018 Beta, Intel® MKL 2017 Update 2; Hardware: Intel® Xeon Phi™ Processor 7250, 68 cores (34 MB total cache, 1.4GHz), 16GB MCDRAM Memory, 96GB of DDR4 Memory; Operating System: RHEL 7.2 GA x86_64 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Intel MKL 2018 Beta Performance improved for ?GEMM_BATCH on all architectures. Greatly improved performance for N==1 ?GEMM_BATCH.

TRSM_BATCH on Intel® Xeon Phi™ Processor 7250
Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 2018 Beta; Hardware: Intel® Xeon Phi™ Processor 7250, 68 cores (34 MB total cache, 1.4GHz), 16GB MCDRAM Memory, 96GB of DDR4 Memory; Operating System: RHEL 7.2 GA x86_64 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Intel MKL 2018 Beta includes ?TRSM_BATCH

Performance tips Store matrices: Contiguously In MCDRAM
Choose appropriate leading dimensions Use 1 hardware thread per core KMP_AFFINITY=compact,1,0,granularity=fine The recommendation is to use a leading dimension that is offset by a cache-line for dimensions that are multiples of 128. For example, you can use the following formula to choose the matrix leading dimensions (element_size = # of bytes for matrix elements – 8 for double): (((n * element_size + 511) / 512) * ) / element_size hbw_malloc. Or use numactl -m 1

Final Remarks Intel® Xeon Phi™ processors are extremely parallel and use general purpose programming Batching better utilizes multi- and many-cores for small/medium matrices Groups contain matrices with same parameters (size, leading dimension, etc.) Intel MKL batched API combines ease-of-use with performance opportunities

Intel MKL resources Website https://software.intel.com/en-us/intel-mkl
License options Forum Link line advisor

Thank You!! Poster Session PP3, 4:30-6:30 PM Tuesday: Accelerating Multiplication of Small or Skinny Matrices with Intel® Math Kernel Library Packed GEMM Routines API for the Compact Batched BLAS, Intel MKL Team

Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # 21

Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors

Similar presentations

Presentation on theme: "Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors

Similar presentations

Presentation on theme: "Optimizing Batched Linear Algebra on Intel® Xeon Phi™ Processors"— Presentation transcript:

Similar presentations

About project

Feedback