周纯葆中国科学院计算机网络信息中心超级计算中心

周纯葆中国科学院计算机网络信息中心超级计算中心 zhoucb@sccas.cn
Intel Xeon Phi 编程基础周纯葆中国科学院计算机网络信息中心超级计算中心

Outline Introduction Native: MIC stand-alone
Offload: MIC as coprocessor Symmetric: MPI

What is MIC (Xeon Phi)? First product of Intel’s Many Integrated Core (MIC) architecture Coprocessor PCI Express card Stripped down Linux operating system Dense, simplified processor Wider vector unit (512) Wider hardware thread count Many power-hungry operations removed (branch prediction…) Lots of names Many Integrated Core architecture, aka MIC Knights Corner (codename) Intel Xeon Phi Coprocessor SE10P (product name)

Intel® Xeon Phi™ vs Intel® Xeon®
Xeon chip designed to work well for all work loads MIC designed to work well for highly parallel Single MIC core is slower than a Xeon core Number of cores on card is much higher

Why use MIC? Intel’s MIC is based on x86 technology
x86 cores caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs “Optimize once, run anywhere”

How to use MIC? Several distinct execution models
How you run your application/program depends on how it uses the MIC “host” “device”

Host-Only: Ignore the MIC
Traditional CPU mode No source-code changes Compile your C/C++/Fortran code with Intel compilers Pure MPI or MPI + “X” X: OpenMP, TBB, Cilk+, etc. Application runs on one or more hosts (only)

Native: One MIC running autonomously
To compile: add one new flag “-mmic” To run: can manually ssh to MIC Multiple programming models available Plausible: OpenMP, MPI+X Implausible: serial, pure MPI Good performance requires high degree of parallelism Application runs on a single MIC (only)

Offload: MIC as assistant processor
A program running on the host “offloads” work by directing the MIC to execute a specified block of code Host directs the exchange of data between host and MIC Resembles GPU programming in ordinary C/C++/Fortran Compile and run on host OpenMP-style directives Ideally, host stays active while the MIC coprocessor does its assigned work “...do work and deliver results as directed...” Application running on host

Automatic Offload (AO)
Pre-packaged offload in Intel Math Kernel Library (MKL) Accelerates frequently-used math processing Computationally intensive linear algebra: highly vectorized and threaded matrix computation, FFT, etc. MKL automatically manages details More than offload: work division/balance across host and MIC! xGemm, xGesv, etc. C/C++, Fortran, R, Python, MATLAB, etc.

Symmetric: MPI Tasks Across Multiple Devices
MPI tasks on MICs and hosts (or multiple MICs only) To compile: build two versions of executable, one for each side To run: use a special MPI launcher that manages the details Challenges include… Heterogeneity and capacity (RAM, performance) Communication (MICs connected only through their hosts) File operations (MIC-based I/O is slow at best, problematic at worst)

Can I port my code to the MIC?
Answer: Probably "Porting" is generally very easy Involves little more than compiling with a new flag The only major issue is availability of libraries But that's the wrong question Will MIC improve my code's performance?

How to achieve the high performance
Degree of parallelism Vectorization (local data parallelism) Threading (multi-thread parallelism) Hard work invested in optimization But it is well worth the effort

MIC Strengths Not just a coprocessor
More than just kernels: MPI, native codes Familiar languages, programming models C, C++, Fortran OpenMP and MPI (and others) Optimizing MIC code will benefit host code 2x improvement (new host to old host) common Accelerator programmers have already done the hard part (encapsulating the offload)

MIC Issues and Questions
Current issues Amortizing the transfers MIC and host working together File operations from MIC Software (especially library) availability Memory limitations

What is a native application?
Application built to run exclusively on the MIC. MIC is not binary compatible with the host processor Instruction set is similar to Pentium, but not all 64 bit scalar extensions are included. MIC has 512 bit vector extensions, but does NOT have MMX, SSE, or AVX extensions. Native applications can’t be used on the host CPU, and vice versa.

Native: One MIC running autonomously
Host Processor Intel® Xeon Phi™ Coprocessor ssh or telnet connection to coprocessor IP address Virtual terminal session Target-side “native” application User code Standard OS libraries plus any 3rd-party or Intel libraries Use if Not serial Modest memory Complex code No hot spots Advantages Simpler model No directives Easier port Good kernel test User-level code User-level code System-level code System-level code Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Intel® Xeon Phi™ Coprocessor communication and application-launch support Linux* OS Linux* OS IB fabric PCI-E Bus PCI-E Bus

Why run a native application?
It is possible to login and run applications on the MIC without any host intervention Easy way to get acquainted with the properties of the MIC Performance studies Single card scaling tests (OpenMP/MPI) No issues with data exchange with host The code probably performs quite well on the host CPU once you build a host version Good path for symmetric runs

Will My Code Run on Xeon Phi?
Yes … but that’s the wrong question Will your code run best on Phi?, or Will you get great Phi performance without additional work?

Building a native application
Xeon Phi fully supported by Intel C/C++ and Fortran compilers (v13+): icc -openmp -mmic mysource.c -o myapp.mic ifort -openmp -mmic mysource.f90 -o myapp.mic The -mmic flag causes the compiler to generate a native coprocessor executable Many/most optimization flags are the same for Xeon Phi It is convenient to use a .mic extension to differentiate coprocessor executables

Running a native application
1. Traditional ssh remote command execution ssh mic0 ls 2. Interactively login to mic & use as a normal Linux server ssh mic0 ./myapp.mic Requires you to set up your own environment variables (libraries, paths, etc.)

Environmental Variables on the MIC
If you run directly from the card use the regular names: OMP_NUM_THREADS KMP_AFFINITY If you use the launcher, use the MIC_ prefix to define them on the host: MIC_OMP_NUM_THREADS MIC_KMP_AFFINITY

Offload: MIC as assistant processor
Host Processor Intel® Xeon Phi™ Coprocessor Offload libraries, user-level driver, user-accessible APIs and libraries User code Host-side offload application User code Offload libraries, user-accessible APIs and libraries Target-side offload application Advantages More memory available Better file access Host faster on serial code Better use of resources User-level code User-level code System-level code System-level code Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Intel® Xeon Phi™ Coprocessor communication and application-launch support Linux* OS Linux* OS PCI-E Bus PCI-E Bus

Offload Models Compiler Assisted Offload (CAO) Automatic Offload (AO)
Explicit Programmer explicitly directs data movement and code execution Implicit Programmer marks some data as “shared” in the virtual sense Runtime automatically synchronizes values between host and MIC Automatic Offload (AO) Computationally intensive calls to Intel Math Kernel Library (MKL) MKL automatically manages details More than offload: work division across host and MIC!

Explicit offload program main F90 use omp_lib integer :: nprocs
nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program F90 offload directive !dir$ offload target(mic) run on MIC run on CPU #include <stdio.h> #include <omp.h> int main( void ) { int nprocs; nprocs = omp_get_num_procs(); printf( "procs: %d\n", nprocs ); return 0; } C/C++ offload pragma #pragma offload target(mic) run on MIC run on CPU

Explicit offload F90 !dir$ offload begin target(mic)
nprocs = omp_get_num_procs() Maxthreads = omp_get_max_threads() !dir$ end offload C/C++ #pragma offload target(mic) { nprocs = omp_get_num_procs(); Maxthreads = omp_get_max_threads(); }

Explicit offload program main F90 integer, parameter :: N = 500000
real :: a(N) !dir$ offload target(mic) !$omp parallel do do i=1,N a(i) = real(i) end do !$omp end parallel do ... F90 int main( void ) { double a[500000]; int i; #pragma offload target(mic) #pragma omp parallel for For ( i=0; i<500000; i++ ) { a[i] = (double)i; } ... C/C++

Explicit offload F90 !dir$ attributes offload:mic :: successor
integer function successor( m ) program main integer :: n !dir$ offload target(mic) n = successor( 123) ... !dir$ attributes offload:mic :: successor int successor( int m ); int main( void ) { int n; #pragma offload target(mic) { n = successor( 123 ); } ... __declspec( target(mic) ) C/C++

Controlling the Offload
Detect MIC(s) Allocate/associate MIC memory Additional decorations (clauses, attributes, specifiers, keywords) give the programmer a high degree of control over all steps in the process. Transfer data to MIC Execute MIC-side code Transfer data from MIC Deallocate MIC memory

Explicit offload F90 !dir$ offload target(mic ) !$omp parallel do
do i=1,N a(i) = real(i) end do !$omp end parallel do :0 C/C++ #pragma offload target(mic ) #pragma omp parallel for for ( i=0; i<500000; i++ ) { a[i] = (double)i; } :0

Explicit offload F90 integer, parameter :: N = 100000
real :: a(N), b(N), c(N), d(N) !dir$ offload target(mic) in( a ), out( c, d ), inout( b ) !$omp parallel do do i=1,N c(i) = a(i) + b(i) d(i) = a(i) - b(i) b(i) = -b(i) end do !$omp end parallel do double a[100000], b[100000], c[100000], d[100000]; #pragma offload target(mic) in( a ), out( c, d ), inout( b ) #pragma omp parallel for for ( i=0; i<100000; i++ ) { c[i] = a[i] + b[i]; d[i] = a[i] - b[i]; b[i] = -b[i]; } C/C++

Explicit offload real, allocatable :: a(:), b(:) F90
integer, parameter :: N = allocate( a(N), b(N) ) !dir$ offload target(mic) in( a : alloc_if(.true.) free_if(.true.) ), out( b : alloc_if(.true.) free_if(.false.) ) !$omp parallel do do i=1,N b(i) = 2.0 * a(i) end do !$omp end parallel do F90 int N = ; double *a, *b; a = ( double* ) memalign( 64, N*sizeof(double) ); b = ( double* ) memalign( 64, N*sizeof(double) ); #pragma offload target(mic) in( a : length(N) alloc_if(1) free_if(1) ), out( b : length(N) alloc_if(1) free_if(0) ) #pragma omp parallel for for ( i=0; i<N; i++ ) { b[i] = 2.0 * a[i]; } C/C++

Explicit offload F90 integer :: n = 123 integer :: tag = 1
!dir$ offload begin target(mic:0) signal( tag ) call incrementslowly( n ) !dir$ end offload print *, " n: ", n C/C++ int n = 123; int tag = 1; #pragma offload target(mic:0) signal( &tag ) incrementSlowly( &n ); printf( "\n\tn = %d \n", n );

Explicit offload F90 integer :: n = 123
!dir$ offload begin target(mic:0) signal( n ) call incrementslowly( n ) !dir$ end offload !dir$ offload_wait target(mic:0) wait( n ) print *, " n: ", n C/C++ int n = 123; #pragma offload target(mic:0) signal( &n ) incrementSlowly( &n ); #pragma offload_wait target(mic:0) wait( &n ) printf( "\n\tn = %d \n", n );

Explicit offload integer :: n = 123
!dir$ offload begin target(mic:0) signal( n ) call incrementslowly( n ) !dir$ end offload !dir$ offload begin target(mic:0) wait( n ) print *, " procs: ", omp_get_num_procs() call flush(0) print *, " n: ", n F90 int n = 123; #pragma offload target(mic:0) signal( &n ) incrementSlowly( &n ); #pragma offload target(mic:0) wait( &n ) { printf( "\n\tprocs: %d\n", omp_get_num_procs() ); fflush(0); } printf( "\n\tn = %d \n", n ); C/C++

Explicit offload F90 !dir$ offload_transfer target(mic:0) in( a : alloc_if(.true.) free_if(.false.) ), nocopy( b : alloc_if(.true.) free_if(.false.) ) signal( tag1 ) !dir$ offload_wait target(mic:0) wait(tag1) C/C++ #pragma offload_transfer target(mic:0) in( a : length(N) alloc_if(1) free_if(0) ), nocopy( b : length(N) alloc_if(1) free_if(0) ) signal( &tag1 ) #pragma offload_wait target(mic:0) wait(&tag1)

Other Key Environment Variables
OMP_NUM_THREADS default is 1; that’s probably not what you want! MIC_OMP_NUM_THREADS default behavior is 244 (var undefined); you definitely don’t want that MIC_KMP_AFFINITY

Offload: making it worthwhile
Enough computation to justify data movement High degree of parallelism threading, vectorization Work division: keep host and MIC active asynchronous directives offloads from OpenMP regions Intelligent data management and alignment persistent data on MIC when possible

Pre-packaged offload in Intel Math Kernel Library (MKL) Computationally intensive linear algebra... ...accomplished by calls to MKL MKL automatically manages details More than offload: work division across host and MIC!

Growing list of supported functions (think large, dense matrix-matrix ops) xGEMM and variants; also LU, QR, Cholesky kicks in at appropriate size thresholds (e.g. SGEMM: (M,N,K) = (2048, 2048, 256) Essentially no programmer action required but there are straightforward mechanisms for tuning and controlling the action

Automatic Offload F90 ... call dgemm( 'N', 'N', size, size, size, 1.0d0, a, size, b, size, 0.0d0, c, size ) C/C++ #include "mkl.h" ... cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, size, size, size, 1.0 , A, size, B, size, 0.0 , C, size );

Automatic Offload Set at least three environment variables before launching your code: export MKL_MIC_ENABLE=1 export OMP_NUM_THREADS=16 export MIC_OMP_NUM_THREADS=240 Other environment variables provide additional fine-grained control over host-MIC work division et al.

Implicit Offload: Virtual Shared Memory
aka Shared Memory, MYO* Programmer marks data as shared between host and MIC; runtime manages synchronization Supports “arbitrarily complex” data structures, including objects and their methods Available only for C/C++

Symmetric Computing Run MPI tasks on both MIC and host
Also called “heterogeneous computing” Two executables are required: CPU MIC Works with Intel MPI

Symmetric: MPI Tasks Across Multiple Devices
Host Processor Intel® Xeon Phi™ Coprocessor Host-side MPI application User code Standard OS libraries plus any 3rd-party or Intel libraries Target-side MPI application User code Standard OS libraries plus any 3rd-party or Intel libraries User-level code User-level code System-level code System-level code Intel MPI support libraries Intel® Xeon Phi™ Coprocessor MPI support libraries Linux* OS Linux* OS IB fabric PCI-E Bus PCI-E Bus

Steps to create a symmetric run
1. Compile a host executable and a MIC executable: $ mpicc -openmp -o host symmetric.c $ mpicc -openmp -mmic -o device symmetric.c 2. Determine the appropriate number of tasks and threads for both MIC and host: 2 tasks/host – 1 thread/MPI task 2 tasks/MIC – 30 threads/MPI task 3.1 mpirun export OMP_NUM_THREADS=1 export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=30 export I_MPI_MIC=1 $ mpirun -np 2 -host locahost ./host : -np 2 -host mic0 ./device : -np 2 -host mic1 ./device

Steps to create a symmetric run
3.2 shell export OMP_NUM_THREADS=1 export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=30 export I_MPI_MIC=1 mpirun -np 4 -ppn 2 -f hostfile ./wrapper.sh ./symmetric 3.3 I_MPI_MIC_POSTFIX export I_MPI_MIC_POSTFIX=.mic mpirun -np 4 -ppn 2 -f hostfile ./symmetric

https://software.intel.com/en-us

周纯葆中国科学院计算机网络信息中心超级计算中心

Similar presentations

Presentation on theme: "周纯葆中国科学院计算机网络信息中心超级计算中心"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

周纯葆 中国科学院计算机网络信息中心 超级计算中心

Similar presentations

Presentation on theme: "周纯葆 中国科学院计算机网络信息中心 超级计算中心"— Presentation transcript:

Similar presentations

About project

Feedback

周纯葆中国科学院计算机网络信息中心超级计算中心

Presentation on theme: "周纯葆中国科学院计算机网络信息中心超级计算中心"— Presentation transcript: