Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1.

Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1

 Architecture Revision ◦ Many simpler (Pentium) cores ◦ Each equipped with powerful 512bit vector engine 7. 1. 2016 by Martin Kruliš (v1.1)2

 Threading Issues ◦ Xeon Phi has ~ 60 cores, each handling 4 threads ◦ The core has two pipes, each processing one instruction per cycle  Limitations on the instruction types and bundling  At least 2 logical cores must be occupied to achieve optimal throughput (depending on the code) ◦ The threads are planned in round robin manner  If one thread gets stalled, others may fill in ◦ Core uses in-order 7 stage pipeline  Most int and mask instructions have 1 clock latency 7. 1. 2016 by Martin Kruliš (v1.1)3

 Memory ◦ 8 memory controllers, 2 32bit channels each operating at 5.5 GT/s = 352 GB/s bandwidth  Caches ◦ 2x 32kB L1, 512kB L2 coherent cache per core  Common 64B cache lines ◦ MESI protocol + distribute tag directory  TD is used to replace missing “ownership” state ◦ Cache coherency may be expensive since the core ring interface may take a while to deliver cache line  Beware of cache hotspots and false sharing 7. 1. 2016 by Martin Kruliš (v1.1)4 Data Alignment !

 Vector Units ◦ Completely new architecture and instruction set  No MMX, SSE, or AVX ◦ Long 512bit registers  For 16 floats or 8 doubles  32 vector registers, 8 mask registers ◦ Instructions have 4-clock latency, 1-clock throughput ◦ Rich instruction set  Scatter/gather, shuffle, and swizzling  Masked (conditional) execution  Ternary instructions (two sources, one dest) 7. 1. 2016 by Martin Kruliš (v1.1)5 Data Alignment !

 Symmetric Communication Interface ◦ Low-level socket-like interface on top of PCIe  Supports both message passing and memory mapping  Peer-to-peer reliable in-order communication ◦ SCIF Node – physical endpoint (host, device, …)  Node ID ~ like IP address (host has always ID = 0) ◦ SCIF Port – logical destination on SCIF node  SCIF Port ~ like TCP/UDP port ◦ SCIF Endpoint - represents connection (like socket)  Can be either listening or connected  Used as handle for any communication 7. 1. 2016 by Martin Kruliš (v1.1)6

7. 1. 2016 by Martin Kruliš (v1.1)7

 Peer-to-peer Communication Topology 7. 1. 2016 by Martin Kruliš (v1.1)8 Multiple endpoints/connections may be created Loopback

 SCIF API ◦ Initialization scif_open(), scif_bind() ◦ Listening (server) scif_listen(), scif_accept() ◦ Connecting (client) scif_connect() ◦ Communication scif_send(), scif_recv() ◦ Termination scif_close() 7. 1. 2016 by Martin Kruliš (v1.1)9 Example

 Memory Transfers ◦ Any memory needs to be registered first  scif_register(), scif_unregister()  Registered address space is a separate space that keeps mappings to physical memory (called windows)  Each endpoint has its own registered address space  Registered windows of remote endpoint can be accessed ◦ Memory operations  Explicit RMA transfers scif_readfrom(), scif_writeto(), …  Memory mapping scif_mmap(), scif_unmap() 7. 1. 2016 by Martin Kruliš (v1.1)10

7. 1. 2016 by Martin Kruliš (v1.1)11 Window Window mapped in VA of a process Mapping to physical memory remains even after VA is unmapped

 Twofold Nature of the Xeon Phi ◦ The cores are based on old architecture, but there are a lot of them and they have powerful VPUs  Wrestling the Compiler ◦ The compiler attempts to vectorize automatically  Often needs little help or explicit code modifications  There are a lot of invariants that may hold, but the programmer does not express them in the code ◦ The programmer may use vector intrinsics or libs ◦ Similar techniques are used in serial environment 7. 1. 2016 by Martin Kruliš (v1.1)12

 Automated Compiler Vectorization float *x; float *y; for (size_t i = 0; i < N; ++i) { y[i] = x[i] * x[i]; } 7. 1. 2016 by Martin Kruliš (v1.1)13 Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2]... Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2]... Allocated arrays of length N Subsequent values may be stored in vector register and computed concurrently The operations must be independent Border cases must be resolved The data should be aligned to vector register size The compiler may not assume the x and y point to different memory blocks

 (Intel) Compiler ◦ Vectorization is affected by optimization switches  -O2, #pragma optimize  -qopt-report= -qopt-report-phase=vec  Get some info about compiler vectorization efforts ◦ Intel-specific optimization pragmas  #pragma unroll(n), #pragma nounroll  #pragma loop_count  #pragma ivdep  #pragma simd, #pragma vector  #pragma vector temporal – stream writes to memory  Vectorization may break correctness (handle with care) 7. 1. 2016 by Martin Kruliš (v1.1)14

 Explicit Vectorization ◦ Inserting instructions in assembler/intrinsics ◦ Cilk Plus __declspec(vector(uniform(b,c),linear(i:1))) float foo(float *b, float *c, int i) { return b[i] + c[i]; } for (i=0; i<N; ++i) a[i] = foo(b, c, i); ◦ Cilk Plus Array Notation for (i=0; i<N; i += veclen) c[i:veclen] = a[i:veclen] + b[i:veclen] ◦ Problem-specific libraries (e.g., MKL) 7. 1. 2016 by Martin Kruliš (v1.1)15

 Other optimizations ◦ Alignment  __declspec(align(16)) float a[N];  Aligned malloc, aligned stl allocators, … ◦ Avoid manual loop unrolling  Brakes automated unrolling ◦ Memory prefetching  #pragma prefetch, #pragma noprefetch  _mm_prefetch(data, hint) ◦ Large 2MB memory pages to reduce TLB misses  mmap() + MAP_HUGETLB instead of malloc() 7. 1. 2016 by Martin Kruliš (v1.1)16

 MPSS ◦ The service must run on host to operate Xeon Phi ◦ It handles card management, bootstrap, network communication, offloads, … ◦ Contains command line tools to manage Xeon Phi  micctrl, micinfo, micsmc, micflash, …  Device Operating System ◦ Customized and pruned Linux  /opt/mpss/ /sysroots - core  /opt/intel/mic/filesystem/micID – customization  /var/mpss/micID - customization 7. 1. 2016 by Martin Kruliš (v1.1)17

 Bootstrap ◦ fboot0  Hardwired (ROM) code that starts first  Authenticate fboot1 and hand control to it ◦ fboot1  Code stored in flash memory (can be modified)  Initialized HW (CPUs, memory, …)  Download coprocessor OS from the host  Authenticate the OS  If it passes, the OS is started in maintanance mode  Otherwise it boots the OS in “regular” (3 rd party) mode  The OS is booted using Linux boot protocol 7. 1. 2016 by Martin Kruliš (v1.1)18

7. 1. 2016 by Martin Kruliš (v1.1)19

Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1.

Similar presentations

Presentation on theme: "Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1.

Similar presentations

Presentation on theme: "Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1."— Presentation transcript:

Similar presentations

About project

Feedback