Martin Kruliš 7. 1. 2016 by Martin Kruliš (v1.1)1.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Threads, SMP, and Microkernels
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Computer Organization and Architecture
1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.
1 Java Networking – Part I CS , Spring 2008/9.
Chapter 17 Parallel Processing.
Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Systems I Locality and Caching
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems David Goldschmidt, Ph.D.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core architectures. Single-core computer Single-core CPU chip.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
History of Microprocessor MPIntroductionData BusAddress Bus
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Full and Para Virtualization
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
My Coordinates Office EM G.27 contact time:
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 Network Communications A Brief Introduction. 2 Network Communications.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Martin Kruliš by Martin Kruliš (v1.1)1.
1 Lecture 5a: CPU architecture 101 boris.
Single Instruction Multiple Threads
Intel Many Integrated Cores Architecture
ECE232: Hardware Organization and Design
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Intel MIC Architecture Internals and Optimizations
5.2 Eleven Advanced Optimizations of Cache Performance
Vector Processing => Multimedia
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Mattan Erez The University of Texas at Austin
CS703 - Advanced Operating Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
ARM ORGANISATION.
Lecture 24: Virtual Memory, Multiprocessors
CS 286 Computer Organization and Architecture
Lecture 23: Virtual Memory, Multiprocessors
System Calls System calls are the user API to the OS
Main Memory Background
CSE 502: Computer Architecture
COT 5611 Operating Systems Design Principles Spring 2014
Interconnection Network and Prefetching
Presentation transcript:

Martin Kruliš by Martin Kruliš (v1.1)1

 Architecture Revision ◦ Many simpler (Pentium) cores ◦ Each equipped with powerful 512bit vector engine by Martin Kruliš (v1.1)2

 Threading Issues ◦ Xeon Phi has ~ 60 cores, each handling 4 threads ◦ The core has two pipes, each processing one instruction per cycle  Limitations on the instruction types and bundling  At least 2 logical cores must be occupied to achieve optimal throughput (depending on the code) ◦ The threads are planned in round robin manner  If one thread gets stalled, others may fill in ◦ Core uses in-order 7 stage pipeline  Most int and mask instructions have 1 clock latency by Martin Kruliš (v1.1)3

 Memory ◦ 8 memory controllers, 2 32bit channels each operating at 5.5 GT/s = 352 GB/s bandwidth  Caches ◦ 2x 32kB L1, 512kB L2 coherent cache per core  Common 64B cache lines ◦ MESI protocol + distribute tag directory  TD is used to replace missing “ownership” state ◦ Cache coherency may be expensive since the core ring interface may take a while to deliver cache line  Beware of cache hotspots and false sharing by Martin Kruliš (v1.1)4 Data Alignment !

 Vector Units ◦ Completely new architecture and instruction set  No MMX, SSE, or AVX ◦ Long 512bit registers  For 16 floats or 8 doubles  32 vector registers, 8 mask registers ◦ Instructions have 4-clock latency, 1-clock throughput ◦ Rich instruction set  Scatter/gather, shuffle, and swizzling  Masked (conditional) execution  Ternary instructions (two sources, one dest) by Martin Kruliš (v1.1)5 Data Alignment !

 Symmetric Communication Interface ◦ Low-level socket-like interface on top of PCIe  Supports both message passing and memory mapping  Peer-to-peer reliable in-order communication ◦ SCIF Node – physical endpoint (host, device, …)  Node ID ~ like IP address (host has always ID = 0) ◦ SCIF Port – logical destination on SCIF node  SCIF Port ~ like TCP/UDP port ◦ SCIF Endpoint - represents connection (like socket)  Can be either listening or connected  Used as handle for any communication by Martin Kruliš (v1.1)6

by Martin Kruliš (v1.1)7

 Peer-to-peer Communication Topology by Martin Kruliš (v1.1)8 Multiple endpoints/connections may be created Loopback

 SCIF API ◦ Initialization scif_open(), scif_bind() ◦ Listening (server) scif_listen(), scif_accept() ◦ Connecting (client) scif_connect() ◦ Communication scif_send(), scif_recv() ◦ Termination scif_close() by Martin Kruliš (v1.1)9 Example

 Memory Transfers ◦ Any memory needs to be registered first  scif_register(), scif_unregister()  Registered address space is a separate space that keeps mappings to physical memory (called windows)  Each endpoint has its own registered address space  Registered windows of remote endpoint can be accessed ◦ Memory operations  Explicit RMA transfers scif_readfrom(), scif_writeto(), …  Memory mapping scif_mmap(), scif_unmap() by Martin Kruliš (v1.1)10

by Martin Kruliš (v1.1)11 Window Window mapped in VA of a process Mapping to physical memory remains even after VA is unmapped

 Twofold Nature of the Xeon Phi ◦ The cores are based on old architecture, but there are a lot of them and they have powerful VPUs  Wrestling the Compiler ◦ The compiler attempts to vectorize automatically  Often needs little help or explicit code modifications  There are a lot of invariants that may hold, but the programmer does not express them in the code ◦ The programmer may use vector intrinsics or libs ◦ Similar techniques are used in serial environment by Martin Kruliš (v1.1)12

 Automated Compiler Vectorization float *x; float *y; for (size_t i = 0; i < N; ++i) { y[i] = x[i] * x[i]; } by Martin Kruliš (v1.1)13 Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2]... Loop unrolling y[i] = x[i] *x[i] y[i+1] = x[i+1]*x[i+1] y[i+2] = x[i+2]*x[i+2]... Allocated arrays of length N Subsequent values may be stored in vector register and computed concurrently The operations must be independent Border cases must be resolved The data should be aligned to vector register size The compiler may not assume the x and y point to different memory blocks

 (Intel) Compiler ◦ Vectorization is affected by optimization switches  -O2, #pragma optimize  -qopt-report= -qopt-report-phase=vec  Get some info about compiler vectorization efforts ◦ Intel-specific optimization pragmas  #pragma unroll(n), #pragma nounroll  #pragma loop_count  #pragma ivdep  #pragma simd, #pragma vector  #pragma vector temporal – stream writes to memory  Vectorization may break correctness (handle with care) by Martin Kruliš (v1.1)14

 Explicit Vectorization ◦ Inserting instructions in assembler/intrinsics ◦ Cilk Plus __declspec(vector(uniform(b,c),linear(i:1))) float foo(float *b, float *c, int i) { return b[i] + c[i]; } for (i=0; i<N; ++i) a[i] = foo(b, c, i); ◦ Cilk Plus Array Notation for (i=0; i<N; i += veclen) c[i:veclen] = a[i:veclen] + b[i:veclen] ◦ Problem-specific libraries (e.g., MKL) by Martin Kruliš (v1.1)15

 Other optimizations ◦ Alignment  __declspec(align(16)) float a[N];  Aligned malloc, aligned stl allocators, … ◦ Avoid manual loop unrolling  Brakes automated unrolling ◦ Memory prefetching  #pragma prefetch, #pragma noprefetch  _mm_prefetch(data, hint) ◦ Large 2MB memory pages to reduce TLB misses  mmap() + MAP_HUGETLB instead of malloc() by Martin Kruliš (v1.1)16

 MPSS ◦ The service must run on host to operate Xeon Phi ◦ It handles card management, bootstrap, network communication, offloads, … ◦ Contains command line tools to manage Xeon Phi  micctrl, micinfo, micsmc, micflash, …  Device Operating System ◦ Customized and pruned Linux  /opt/mpss/ /sysroots - core  /opt/intel/mic/filesystem/micID – customization  /var/mpss/micID - customization by Martin Kruliš (v1.1)17

 Bootstrap ◦ fboot0  Hardwired (ROM) code that starts first  Authenticate fboot1 and hand control to it ◦ fboot1  Code stored in flash memory (can be modified)  Initialized HW (CPUs, memory, …)  Download coprocessor OS from the host  Authenticate the OS  If it passes, the OS is started in maintanance mode  Otherwise it boots the OS in “regular” (3 rd party) mode  The OS is booted using Linux boot protocol by Martin Kruliš (v1.1)18

by Martin Kruliš (v1.1)19