Intel Many Integrated Cores Architecture

Slides:

Advertisements

Similar presentations

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Advertisements

Introductions to Parallel Programming Using OpenMP

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

Types of Parallel Computers

Parallel/Concurrent Programming on the SGI Altix Conley Read January 25, 2007 UC Riverside, Department of Computer Science.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Contemporary Languages in Parallel Computing Raymond Hummel.

Panda: MapReduce Framework on GPU’s and CPU’s

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

GPU Architecture and Programming

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Full and Para Virtualization

Martin Kruliš by Martin Kruliš (v1.0)1.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

SCIF. SCIF provides a mechanism for inter-node communication within a single platform, where a node is an Intel® MIC Architecture based PCIe card or an.

Martin Kruliš by Martin Kruliš (v1.1)1.

Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Martin Kruliš by Martin Kruliš (v1.1)1.

J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.

Comparison of Threading Programming Models

Computer Organization and Architecture Lecture 1 : Introduction

Multi-GPU Programming

NFV Compute Acceleration APIs and Evaluation

Multiprocessor System Distributed System

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Operating Systems CMPSC 473

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Intel MIC Architecture Internals and Optimizations

OCR on Knights Landing (Xeon-Phi)

3- Parallel Programming Models

Computer Engg, IIT(BHU)

Task Scheduling for Multicore CPUs and NUMA Systems

Constructing a system with multiple computers or processors

FPGAs in AWS and First Use Cases, Kees Vissers

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CMSC 611: Advanced Computer Architecture

周纯葆中国科学院计算机网络信息中心超级计算中心

Chapter 4: Threads.

Mattan Erez The University of Texas at Austin

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Lecture Topics: 11/1 General Operating System Concepts Processes

Lecture 4- Threads, SMP, and Microkernels

Constructing a system with multiple computers or processors

Chapter 4: Threads & Concurrency

Architectural Support for OS

Multicore and GPU Programming

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)

Types of Parallel Computers

OpenMP Martin Kruliš.

Chapter 13: I/O Systems.

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Intel Many Integrated Cores Architecture Martin Kruliš by Martin Kruliš (v1.2) 13. 10. 2016

Intel Many Integrated Cores History Original idea – there is a lot of silicon on a CPU die Use many x86 cores to create a GPU 2006 Project Larabee The GPU could not keep up with AMD and NVIDIA 2007 Teraflops Research Chip (96-bit VLIW arch) 2009 Single Chip Cloud Computer (48 cores) 2010 Knights Ferry (32 cores) 2011 Knights Corner (60 cores based on Pentium 1) 2013 Knights Landing (72 cores based on Atom) Knights Hill announced by Martin Kruliš (v1.2) 13. 10. 2016

Intel MIC Architecture The Xeon Phi Device Many simpler (Pentium/Atom) cores Each equipped with powerful 512bit vector engine by Martin Kruliš (v1.2) 13. 10. 2016

Intel MIC Architecture Software Architecture Xeon Phi is basically an independent Linux machine Example by Martin Kruliš (v1.2) 13. 10. 2016

Usage Modes Modes of Xeon Phi Usage OpenCL device Standalone computational device Connected over TCP/IP (SSH, …) Using low-level symmetric communications interface MPI device Communicating over TCP/IP or OFED May be used in both ways (ranked from host or device) Offload device Explicit mode offloading Implicit mode offloading OFED = Open Fabric Enterprise Edition by Martin Kruliš (v1.2) 13. 10. 2016

OpenCL Xeon Phi as OpenCL Accelerator Card Requires Intel OpenCL platform HW to OpenCL mapping Xeon Phi card = OCL compute device Virtual core = OCL compute unit Work items in work group are processed in a loop Unrolled 16x and vectorized when possible OCL manages thread pool One thread per virtual core Work groups are assigned to threads as tasks Better to run more WGs and less WIs then on GPU The Intel OpenCL library directory must be the first directory in the LD_LIBRARY_PATH list. Example by Martin Kruliš (v1.2) 13. 10. 2016

Standalone Device Using Xeon Phi as Standalone Device The device is autonomous linux machine The code is cross-compiled on host and deployed on Xeon Phi (including libraries) Complete freedom in parallelization techniques OpenMP, Intel TBB, Intel CILK, pthreads, … Communication has to be performed manually by Symmetric communication interface (SCIF) TCP/IP stack, which uses SCIF as data link layer Useful for extending master-worker applications Example by Martin Kruliš (v1.2) 13. 10. 2016

SCIF Symmetric Communications Interface Socket-like interface that encapsulates PCI-Express data transfers Message passing and RMA transfers Memory mapping techniques Device memory may be mapped to host address space Host memory and memory of other devices can be mapped to address space of a device Upper 512G (32x 16G pages) Supports direct assignment virtualization model All other communication methods are built on SCIF by Martin Kruliš (v1.2) 13. 10. 2016

Offload Model Offload Execution Model Both host and device code is written together Offload parts for the device are explicitly marked The compiler performs the dual compilation, inserts the stubs, handles the data transfers, … Explicit offload model (a.k.a. Pragma offload) Everything is controlled by programmer Only binary-safe data structures can be transferred Implicit offload model (a.k.a. Shared VM model) Data transfers are handled automatically Complex data structures and pointers may be transferred by Martin Kruliš (v1.2) 13. 10. 2016

Offload Model Code Compilation Source Host MIC Compilation Stub #pragma offload Stub _Cilk_offload by Martin Kruliš (v1.2) 13. 10. 2016

Explicit Offload Model Pragma Offload Functions and variables are declared with __attribute__((target(mic))) Offloaded code is invoked as #pragma offload <clauses> <offloaded_statement> A clause may select target card target(mic[:id]) Or list data structures which are used for the offload in(varlist) – copied to the device before the offload out(varlist) – copied back to host after the offload inout, nocopy, length, align Allocation control alloc_if(), free_if() by Martin Kruliš (v1.2) 13. 10. 2016

Explicit Offload Model Example __attribute__((target(mic))) void preprocess(…) { … } ... __attribute__((target(mic))) static float *X; // of N items #pragma offload target(mic:0) \ in(X:length(N) alloc_if(1) free_if(0)) preprocess(X); nocopy(X:length(N) alloc_if(0) free_if(0)) process(X); out(X:length(N) alloc_if(0) free_if(1)) finalize(X); by Martin Kruliš (v1.2) 13. 10. 2016

Explicit Offload Model Asynchronous Operations Execution char sigVar; #pragma offload target(mic) signal(&sigVar) long_lasting_func(); ... concurrent CPU work ... Data transfers #pragma offload_transfer clauses signal(&sigVar) Waiting, polling #pragma offload_wait wait(&sigVar) if (_Offload_signaled(micID, &sigVar)) ... The wait(&sigVar) clause can be added to offload or offload_transfer pragmas to create dependencies. by Martin Kruliš (v1.2) 13. 10. 2016

Implicit Offload Model Shared Virtual Memory Mode Code and variables that are shared have attribute _Cilk_shared (e.g., float _Cilk_shared *X) Shared memory allocation must be performed via specialized functions _Offload_shared_malloc, _Offload_shared_free Offloaded code gets executed by _Cilk_offload <statement> _Cilk_offload_to(micId) <statement> The data are transferred as compiler see fit Example by Martin Kruliš (v1.2) 13. 10. 2016

Offload Model A Few More Things Compile time macros for MIC detection __INTEL_OFFLOAD, __MIC__ Runtime MIC detection _Offload_number_of_devices() _Offload_get_device_number() Querying device capabilities is more complicated You can read /proc/cpuinfo on the device Or use special MicAccessAPI Asynchronous data transfers and execution Using signal variables for synchronization by Martin Kruliš (v1.2) 13. 10. 2016

New Xeon Phi Xeon Phi (x200 family) Available as a PCIe card and “regular” CPU by Martin Kruliš (v1.2) 13. 10. 2016

New Xeon Phi Xeon Phi (x200 family) Up to 72 cores organized in a 2D mesh (8x9 cores) by Martin Kruliš (v1.2) 13. 10. 2016

New Xeon Phi Xeon Phi (x200 family) Cores based on Atom (Airmont) architecture More versatile than previous Pentium-based cores Supports also SSE, AVX, AVX2, … Two AVX512 units per core Memory 16GB on chip MCDRAM (either used as transparent cache or as a separate NUMA node) Up to 384 GB DDR4 Connectivity Intel Omni-Path Fabric (via 2x16 PCIe links) by Martin Kruliš (v1.2) 13. 10. 2016

Discussion by Martin Kruliš (v1.2) 13. 10. 2016