Martin Kruliš by Martin Kruliš (v1.1)1
Architecture Revision ◦ Many simpler (Pentium) cores ◦ Each equipped with powerful 512bit vector engine by Martin Kruliš (v1.1)2
Threading Issues ◦ Xeon Phi has ~ 60 cores, each handling 4 threads ◦ The core has two pipes, each processing one instruction per cycle Limitations on the instruction types and bundling At least 2 logical cores must be occupied to achieve optimal throughput (depending on the code) ◦ The threads are planned in round robin manner If one thread gets stalled, others may fill in ◦ Core uses in-order 7 stage pipeline Most int and mask instructions have 1 clock latency by Martin Kruliš (v1.1)3
Memory ◦ 8 memory controllers, 2 32bit channels each operating at 5.5 GT/s = 352 GB/s bandwidth Caches ◦ 2x 32kB L1, 512kB L2 coherent cache per core Common 64B cache lines ◦ MESI protocol + distribute tag directory TD is used to replace missing “ownership” state ◦ Cache coherency may be expensive since the core ring interface may take a while to deliver cache line Beware of cache hotspots and false sharing by Martin Kruliš (v1.1)4 Data Alignment !
Vector Units ◦ Completely new architecture and instruction set No MMX, SSE, or AVX ◦ Long 512bit registers For 16 floats or 8 doubles 32 vector registers, 8 mask registers ◦ Instructions have 4-clock latency, 1-clock throughput ◦ Rich instruction set Scatter/gather, shuffle, and swizzling Masked (conditional) execution Ternary instructions (two sources, one dest) by Martin Kruliš (v1.1)5 Data Alignment !
Symmetric Communication Interface ◦ Low-level socket-like interface on top of PCIe Supports both message passing and memory mapping Peer-to-peer reliable in-order communication ◦ SCIF Node – physical endpoint (host, device, …) Node ID ~ like IP address (host has always ID = 0) ◦ SCIF Port – logical destination on SCIF node SCIF Port ~ like TCP/UDP port ◦ SCIF Endpoint - represents connection (like socket) Can be either listening or connected Used as handle for any communication by Martin Kruliš (v1.1)6
by Martin Kruliš (v1.1)7
Peer-to-peer Communication Topology by Martin Kruliš (v1.1)8 Multiple endpoints/connections may be created Loopback
SCIF API ◦ Initialization scif_open(), scif_bind() ◦ Listening (server) scif_listen(), scif_accept() ◦ Connecting (client) scif_connect() ◦ Communication scif_send(), scif_recv() ◦ Termination scif_close() by Martin Kruliš (v1.1)9 Example
Memory Transfers ◦ Any memory needs to be registered first scif_register(), scif_unregister() Registered address space is a separate space that keeps mappings to physical memory (called windows) Each endpoint has its own registered address space Registered windows of remote endpoint can be accessed ◦ Memory operations Explicit RMA transfers scif_readfrom(), scif_writeto(), … Memory mapping scif_mmap(), scif_unmap() by Martin Kruliš (v1.1)10
by Martin Kruliš (v1.1)11 Window Window mapped in VA of a process Mapping to physical memory remains even after VA is unmapped
Twofold Nature of the Xeon Phi ◦ The cores are based on old architecture, but there are a lot of them and they have powerful VPUs Wrestling the Compiler ◦ The compiler attempts to vectorize automatically Often needs little help or explicit code modifications There are a lot of invariants that may hold, but the programmer does not express them in the code ◦ The programmer may use vector intrinsics or libs ◦ Similar techniques are used in serial environment by Martin Kruliš (v1.1)12
Automated Compiler Vectorization float *x; float *y; for (size_t i = 0; i < N; ++i) { y[i] = x[i] * x[i]; } by Martin Kruliš (v1.1)13 Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2]... Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2]... Allocated arrays of length N Subsequent values may be stored in vector register and computed concurrently The operations must be independent Border cases must be resolved The data should be aligned to vector register size The compiler may not assume the x and y point to different memory blocks
(Intel) Compiler ◦ Vectorization is affected by optimization switches -O2, #pragma optimize -qopt-report= -qopt-report-phase=vec Get some info about compiler vectorization efforts ◦ Intel-specific optimization pragmas #pragma unroll(n), #pragma nounroll #pragma loop_count #pragma ivdep #pragma simd, #pragma vector #pragma vector temporal – stream writes to memory Vectorization may break correctness (handle with care) by Martin Kruliš (v1.1)14
Explicit Vectorization ◦ Inserting instructions in assembler/intrinsics ◦ Cilk Plus __declspec(vector(uniform(b,c),linear(i:1))) float foo(float *b, float *c, int i) { return b[i] + c[i]; } for (i=0; i<N; ++i) a[i] = foo(b, c, i); ◦ Cilk Plus Array Notation for (i=0; i<N; i += veclen) c[i:veclen] = a[i:veclen] + b[i:veclen] ◦ Problem-specific libraries (e.g., MKL) by Martin Kruliš (v1.1)15
Other optimizations ◦ Alignment __declspec(align(16)) float a[N]; Aligned malloc, aligned stl allocators, … ◦ Avoid manual loop unrolling Brakes automated unrolling ◦ Memory prefetching #pragma prefetch, #pragma noprefetch _mm_prefetch(data, hint) ◦ Large 2MB memory pages to reduce TLB misses mmap() + MAP_HUGETLB instead of malloc() by Martin Kruliš (v1.1)16
MPSS ◦ The service must run on host to operate Xeon Phi ◦ It handles card management, bootstrap, network communication, offloads, … ◦ Contains command line tools to manage Xeon Phi micctrl, micinfo, micsmc, micflash, … Device Operating System ◦ Customized and pruned Linux /opt/mpss/ /sysroots - core /opt/intel/mic/filesystem/micID – customization /var/mpss/micID - customization by Martin Kruliš (v1.1)17
Bootstrap ◦ fboot0 Hardwired (ROM) code that starts first Authenticate fboot1 and hand control to it ◦ fboot1 Code stored in flash memory (can be modified) Initialized HW (CPUs, memory, …) Download coprocessor OS from the host Authenticate the OS If it passes, the OS is started in maintanance mode Otherwise it boots the OS in “regular” (3 rd party) mode The OS is booted using Linux boot protocol by Martin Kruliš (v1.1)18
by Martin Kruliš (v1.1)19