Intel MIC Architecture Internals and Optimizations

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Threads, SMP, and Microkernels
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Computer Organization and Architecture
Multiple Processor Systems
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Systems I Locality and Caching
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems David Goldschmidt, Ph.D.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Single Node Optimization Computational Astrophysics.
SCIF. SCIF provides a mechanism for inter-node communication within a single platform, where a node is an Intel® MIC Architecture based PCIe card or an.
Martin Kruliš by Martin Kruliš (v1.1)1.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Martin Kruliš by Martin Kruliš (v1.1)1.
1 Lecture 5a: CPU architecture 101 boris.
Introduction to Operating Systems Concepts
Operating Systems {week 01.b}
Intel Many Integrated Cores Architecture
COMP 740: Computer Architecture and Implementation
William Stallings Computer Organization and Architecture 7th Edition
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
ECE232: Hardware Organization and Design
Chapter 3: Process Concept
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
File System Implementation
CS 147 – Parallel Processing
Chapter 4 Threads.
Simple Illustration of L1 Bandwidth Limitations on Vector Performance
5.2 Eleven Advanced Optimizations of Cache Performance
William Stallings Computer Organization and Architecture 7th Edition
/ Computer Architecture and Design
William Stallings Computer Organization and Architecture 8th Edition
CMSC 611: Advanced Computer Architecture
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Mattan Erez The University of Texas at Austin
Threads, SMP, and Microkernels
William Stallings Computer Organization and Architecture 8th Edition
Operating System Concepts
CS703 - Advanced Operating Systems
Multiple Processor Systems
/ Computer Architecture and Design
/ Computer Architecture and Design
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
William Stallings Computer Organization and Architecture 8th Edition
ARM ORGANISATION.
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
System Calls System calls are the user API to the OS
Main Memory Background
Chapter 11 Processor Structure and function
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Chapter 1: Introduction CSS503 Systems Programming
CSE 502: Computer Architecture
COT 5611 Operating Systems Design Principles Spring 2014
Module 12: I/O Systems I/O hardwared Application I/O Interface
Interconnection Network and Prefetching
Presentation transcript:

Intel MIC Architecture Internals and Optimizations Martin Kruliš by Martin Kruliš (v1.2) 20.10.2017

Intel MIC Architecture Architecture Revision Many simpler (Pentium) cores Each equipped with powerful 512bit vector engine by Martin Kruliš (v1.2) 20.10.2017

MIC Architecture - KNC Threading Issues Xeon Phi has ~ 60 cores, each handling 4 threads The core has two pipes, each processing one instruction per cycle Limitations on the instruction types and bundling At least 2 logical cores must be occupied to achieve optimal throughput (depending on the code) The threads are planned in round robin manner If one thread gets stalled, others may fill in Core uses in-order 7 stage pipeline Most int and mask instructions have 1 clock latency by Martin Kruliš (v1.2) 20.10.2017

MIC Architecture - KNC Memory Caches 8 memory controllers, two 32bit channels each operating at 5.5 GT/s = 352 GB/s bandwidth Caches 2x 32kB L1, 512kB L2 coherent cache per core Common 64B cache lines MESI protocol + distribute tag directory TD is used to replace missing “ownership” state Cache coherency may be expensive since the core ring interface may take a while to deliver cache line Beware of cache hotspots and false sharing Data Alignment ! Access times: L1 – 1 cycle, L2 – 11 cycles L1 – read-after-write latency: 11 cycles by Martin Kruliš (v1.2) 20.10.2017

MIC Architecture - KNC Vector Units Completely new architecture and instruction set No MMX, SSE, or AVX Long 512bit registers For 16 floats or 8 doubles 32 vector registers, 8 mask registers Instructions have 4-clock latency, 1-clock throughput Rich instruction set Scatter/gather, shuffle, and swizzling Masked (conditional) execution Ternary instructions (two sources, one dest) Data Alignment ! by Martin Kruliš (v1.2) 20.10.2017

A tile of 2 cores and 1MB of L2 cahce MIC Architecture - KNL Different Chip Topology 2D mesh of up to 72 cores A tile of 2 cores and 1MB of L2 cahce 2 VPUs per core which support new AVX512 as well as older SSE, AVX, and AVX2 New AVX512 extensions: Collision Detection – imagine code “for(i=0; i<16; i++) { A[B[i]]++; }” The code can be vectorized only if values in B are unique. New instructions like get conflict-free subset were added. Scattered and Gathered prefetch Exponential and Reciprocal instructions by Martin Kruliš (v1.2) 20.10.2017

MIC Architecture - KNL Memory Models MCDRAM (16GB) DDR MCDRAM (12/8GB) Transparent Cache Individual NUMA nodes Hybrid mode by Martin Kruliš (v1.2) 20.10.2017

SCIF Symmetric Communication Interface Low-level socket-like interface on top of PCIe Supports both message passing and memory mapping Peer-to-peer reliable in-order communication SCIF Node – physical endpoint (host, device, …) Node ID ~ like IP address (host has always ID = 0) SCIF Port – logical destination on SCIF node SCIF Port ~ like TCP/UDP port SCIF Endpoint - represents connection (like socket) Can be either listening or connected Used as handle for any communication Endpoint is represented as a file descriptor on Linux (the FD can be accessed by scif_get_fd(epd)). Thus, it can be used in select() or poll() calls, and it is duplicated on fork(). by Martin Kruliš (v1.2) 20.10.2017

SCIF by Martin Kruliš (v1.2) 20.10.2017

Multiple endpoints/connections may be created SCIF Peer-to-peer Communication Topology Multiple endpoints/connections may be created Loopback by Martin Kruliš (v1.2) 20.10.2017

SCIF SCIF API Initialization scif_open(), scif_bind() Listening (server) scif_listen(), scif_accept() Connecting (client) scif_connect() Communication scif_send(), scif_recv() Termination scif_close() Example by Martin Kruliš (v1.2) 20.10.2017

SCIF Memory Transfers Memory registration scif_register(), scif_unregister() Registered address space is a separate space that keeps mappings to physical memory (called windows) Windows are identified by offset and length Registration is performed with 4KB granularity (alignment) Physical memory is identified through current VM mapping Each endpoint has its own registered address space This space is independent on virtual address space of any process So it can be shared, mapped, or accessed via RMA See SCIF_UserGuide.pdf for more details. by Martin Kruliš (v1.2) 20.10.2017

Registered Memory Window Mapping to physical memory remains even after VA is unmapped Window mapped in VA of a process May be stored discontinuously (regular paging is applied) by Martin Kruliš (v1.2) 20.10.2017

SCIF Memory Transfers Explicit RMA transfers Memory mapping scif_readfrom(), scif_writeto(), … Read/write data from/to window of given EP Memory mapping scif_mmap(), scif_unmap() Create mapping of a window to virtual address space The window can be on a remote EP Synchronization scif_fence_mark() – mark previous unfinished RMAs scif_fence_wait() – wait for marked RMAs to finish scif_fence_signal() scif_fence_signal() writes 2 given values to 2 given registered windows (one local one remote) AFTER the RMA transfers complete Optimization details: Data are transferred by PCIe transactions (PCIe defines the ordering, min transaction size is 64B). It is better to write (push) data than to read them (read = request for data + write transaction). At writing side, it might be better to use write-combined (WC) flag (the write might be propagated to PCIe transaction faster, but the memory is not cached). by Martin Kruliš (v1.2) 20.10.2017

Optimizations Twofold Nature of the Xeon Phi Wrestling the Compiler The cores are based on old architecture, but there are a lot of them and they have powerful VPUs Wrestling the Compiler The compiler attempts to vectorize automatically Often needs little help or explicit code modifications There are a lot of invariants that may hold, but the programmer does not express them in the code The programmer may use vector intrinsics or libs Similar techniques are used in serial environment by Martin Kruliš (v1.2) 20.10.2017

Optimizations Automated Compiler Vectorization float *x; float *y; for (size_t i = 0; i < N; ++i) { y[i] = x[i] * x[i]; } Allocated arrays of length N The data should be aligned to vector register size Border cases must be resolved The operations must be independent The compiler may not assume the x and y point to different memory blocks Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2] ... Subsequent values may be stored in vector register and computed concurrently by Martin Kruliš (v1.2) 20.10.2017

Optimizations (Intel) Compiler Vectorization is affected by optimization switches -O2, #pragma optimize -qopt-report=<level> -qopt-report-phase=vec Get some info about compiler vectorization efforts Intel-specific optimization pragmas #pragma unroll(n), #pragma nounroll #pragma loop_count #pragma ivdep #pragma simd, #pragma vector #pragma vector temporal – stream writes to memory Vectorization may break correctness (handle with care) by Martin Kruliš (v1.2) 20.10.2017

Optimizations Explicit Vectorization Inserting instructions in assembler/intrinsics Cilk Plus __declspec(vector(uniform(b,c),linear(i:1))) float foo(float *b, float *c, int i) { return b[i] + c[i]; } for (i=0; i<N; ++i) a[i] = foo(b, c, i); Cilk Plus Array Notation for (i=0; i<N; i += veclen) c[i:veclen] = a[i:veclen] + b[i:veclen] Problem-specific libraries (e.g., MKL) by Martin Kruliš (v1.2) 20.10.2017

Optimizations Other optimizations Alignment __declspec(align(16)) float a[N]; Aligned malloc, aligned stl allocators, … Avoid manual loop unrolling Brakes automated unrolling Memory prefetching #pragma prefetch, #pragma noprefetch _mm_prefetch(data, hint) Large 2MB memory pages to reduce TLB misses mmap() + MAP_HUGETLB instead of malloc() by Martin Kruliš (v1.2) 20.10.2017

Technical Details MPSS Device Operating System The service must run on host to operate Xeon Phi It handles card management, bootstrap, network communication, offloads, … Contains command line tools to manage Xeon Phi micctrl, micinfo, micsmc, micflash, … Device Operating System Customized and pruned Linux /opt/mpss/<version>/sysroots - core /opt/intel/mic/filesystem/micID – customization /var/mpss/micID - customization by Martin Kruliš (v1.2) 20.10.2017

Technical Details Bootstrap fboot0 fboot1 Hardwired (ROM) code that starts first Authenticate fboot1 and hand control to it fboot1 Code stored in flash memory (can be modified) Initialized HW (CPUs, memory, …) Download coprocessor OS from the host Authenticate the OS If it passes, the OS is started in maintanance mode Otherwise it boots the OS in “regular” (3rd party) mode The OS is booted using Linux boot protocol If fboot0 authentication fails, the card is switched into “zombie mode”. One can recovery from zombie mode only by switching a jumper on the card and flashing the fboot1 code. by Martin Kruliš (v1.2) 20.10.2017

Discussion by Martin Kruliš (v1.2) 20.10.2017