Intel MIC Architecture Internals and Optimizations

Intel MIC Architecture Internals and Optimizations
Martin Kruliš by Martin Kruliš (v1.2)

Intel MIC Architecture
Architecture Revision Many simpler (Pentium) cores Each equipped with powerful 512bit vector engine by Martin Kruliš (v1.2)

MIC Architecture - KNC Threading Issues
Xeon Phi has ~ 60 cores, each handling 4 threads The core has two pipes, each processing one instruction per cycle Limitations on the instruction types and bundling At least 2 logical cores must be occupied to achieve optimal throughput (depending on the code) The threads are planned in round robin manner If one thread gets stalled, others may fill in Core uses in-order 7 stage pipeline Most int and mask instructions have 1 clock latency by Martin Kruliš (v1.2)

MIC Architecture - KNC Memory Caches
8 memory controllers, two 32bit channels each operating at 5.5 GT/s = 352 GB/s bandwidth Caches 2x 32kB L1, 512kB L2 coherent cache per core Common 64B cache lines MESI protocol + distribute tag directory TD is used to replace missing “ownership” state Cache coherency may be expensive since the core ring interface may take a while to deliver cache line Beware of cache hotspots and false sharing Data Alignment ! Access times: L1 – 1 cycle, L2 – 11 cycles L1 – read-after-write latency: 11 cycles by Martin Kruliš (v1.2)

MIC Architecture - KNC Vector Units
Completely new architecture and instruction set No MMX, SSE, or AVX Long 512bit registers For 16 floats or 8 doubles 32 vector registers, 8 mask registers Instructions have 4-clock latency, 1-clock throughput Rich instruction set Scatter/gather, shuffle, and swizzling Masked (conditional) execution Ternary instructions (two sources, one dest) Data Alignment ! by Martin Kruliš (v1.2)

A tile of 2 cores and 1MB of L2 cahce
MIC Architecture - KNL Different Chip Topology 2D mesh of up to 72 cores A tile of 2 cores and 1MB of L2 cahce 2 VPUs per core which support new AVX512 as well as older SSE, AVX, and AVX2 New AVX512 extensions: Collision Detection – imagine code “for(i=0; i<16; i++) { A[B[i]]++; }” The code can be vectorized only if values in B are unique. New instructions like get conflict-free subset were added. Scattered and Gathered prefetch Exponential and Reciprocal instructions by Martin Kruliš (v1.2)

MIC Architecture - KNL Memory Models MCDRAM (16GB) DDR MCDRAM (12/8GB)
Transparent Cache Individual NUMA nodes Hybrid mode by Martin Kruliš (v1.2)

SCIF Symmetric Communication Interface
Low-level socket-like interface on top of PCIe Supports both message passing and memory mapping Peer-to-peer reliable in-order communication SCIF Node – physical endpoint (host, device, …) Node ID ~ like IP address (host has always ID = 0) SCIF Port – logical destination on SCIF node SCIF Port ~ like TCP/UDP port SCIF Endpoint - represents connection (like socket) Can be either listening or connected Used as handle for any communication Endpoint is represented as a file descriptor on Linux (the FD can be accessed by scif_get_fd(epd)). Thus, it can be used in select() or poll() calls, and it is duplicated on fork(). by Martin Kruliš (v1.2)

SCIF by Martin Kruliš (v1.2)

Multiple endpoints/connections may be created
SCIF Peer-to-peer Communication Topology Multiple endpoints/connections may be created Loopback by Martin Kruliš (v1.2)

SCIF SCIF API Initialization scif_open(), scif_bind()
Listening (server) scif_listen(), scif_accept() Connecting (client) scif_connect() Communication scif_send(), scif_recv() Termination scif_close() Example by Martin Kruliš (v1.2)

SCIF Memory Transfers Memory registration
scif_register(), scif_unregister() Registered address space is a separate space that keeps mappings to physical memory (called windows) Windows are identified by offset and length Registration is performed with 4KB granularity (alignment) Physical memory is identified through current VM mapping Each endpoint has its own registered address space This space is independent on virtual address space of any process So it can be shared, mapped, or accessed via RMA See SCIF_UserGuide.pdf for more details. by Martin Kruliš (v1.2)

Registered Memory Window
Mapping to physical memory remains even after VA is unmapped Window mapped in VA of a process May be stored discontinuously (regular paging is applied) by Martin Kruliš (v1.2)

SCIF Memory Transfers Explicit RMA transfers Memory mapping
scif_readfrom(), scif_writeto(), … Read/write data from/to window of given EP Memory mapping scif_mmap(), scif_unmap() Create mapping of a window to virtual address space The window can be on a remote EP Synchronization scif_fence_mark() – mark previous unfinished RMAs scif_fence_wait() – wait for marked RMAs to finish scif_fence_signal() scif_fence_signal() writes 2 given values to 2 given registered windows (one local one remote) AFTER the RMA transfers complete Optimization details: Data are transferred by PCIe transactions (PCIe defines the ordering, min transaction size is 64B). It is better to write (push) data than to read them (read = request for data + write transaction). At writing side, it might be better to use write-combined (WC) flag (the write might be propagated to PCIe transaction faster, but the memory is not cached). by Martin Kruliš (v1.2)

Optimizations Twofold Nature of the Xeon Phi Wrestling the Compiler
The cores are based on old architecture, but there are a lot of them and they have powerful VPUs Wrestling the Compiler The compiler attempts to vectorize automatically Often needs little help or explicit code modifications There are a lot of invariants that may hold, but the programmer does not express them in the code The programmer may use vector intrinsics or libs Similar techniques are used in serial environment by Martin Kruliš (v1.2)

Optimizations Automated Compiler Vectorization float *x; float *y;
for (size_t i = 0; i < N; ++i) { y[i] = x[i] * x[i]; } Allocated arrays of length N The data should be aligned to vector register size Border cases must be resolved The operations must be independent The compiler may not assume the x and y point to different memory blocks Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2] ... Subsequent values may be stored in vector register and computed concurrently by Martin Kruliš (v1.2)

Optimizations (Intel) Compiler
Vectorization is affected by optimization switches -O2, #pragma optimize -qopt-report=<level> -qopt-report-phase=vec Get some info about compiler vectorization efforts Intel-specific optimization pragmas #pragma unroll(n), #pragma nounroll #pragma loop_count #pragma ivdep #pragma simd, #pragma vector #pragma vector temporal – stream writes to memory Vectorization may break correctness (handle with care) by Martin Kruliš (v1.2)

Optimizations Explicit Vectorization
Inserting instructions in assembler/intrinsics Cilk Plus __declspec(vector(uniform(b,c),linear(i:1))) float foo(float *b, float *c, int i) { return b[i] + c[i]; } for (i=0; i<N; ++i) a[i] = foo(b, c, i); Cilk Plus Array Notation for (i=0; i<N; i += veclen) c[i:veclen] = a[i:veclen] + b[i:veclen] Problem-specific libraries (e.g., MKL) by Martin Kruliš (v1.2)

Optimizations Other optimizations Alignment
__declspec(align(16)) float a[N]; Aligned malloc, aligned stl allocators, … Avoid manual loop unrolling Brakes automated unrolling Memory prefetching #pragma prefetch, #pragma noprefetch _mm_prefetch(data, hint) Large 2MB memory pages to reduce TLB misses mmap() + MAP_HUGETLB instead of malloc() by Martin Kruliš (v1.2)

Technical Details MPSS Device Operating System
The service must run on host to operate Xeon Phi It handles card management, bootstrap, network communication, offloads, … Contains command line tools to manage Xeon Phi micctrl, micinfo, micsmc, micflash, … Device Operating System Customized and pruned Linux /opt/mpss/<version>/sysroots - core /opt/intel/mic/filesystem/micID – customization /var/mpss/micID - customization by Martin Kruliš (v1.2)

Technical Details Bootstrap fboot0 fboot1
Hardwired (ROM) code that starts first Authenticate fboot1 and hand control to it fboot1 Code stored in flash memory (can be modified) Initialized HW (CPUs, memory, …) Download coprocessor OS from the host Authenticate the OS If it passes, the OS is started in maintanance mode Otherwise it boots the OS in “regular” (3rd party) mode The OS is booted using Linux boot protocol If fboot0 authentication fails, the card is switched into “zombie mode”. One can recovery from zombie mode only by switching a jumper on the card and flashing the fboot1 code. by Martin Kruliš (v1.2)

Discussion by Martin Kruliš (v1.2)

Intel MIC Architecture Internals and Optimizations

Similar presentations

Presentation on theme: "Intel MIC Architecture Internals and Optimizations"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intel MIC Architecture Internals and Optimizations

Similar presentations

Presentation on theme: "Intel MIC Architecture Internals and Optimizations"— Presentation transcript:

Similar presentations

About project

Feedback