ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Reference: Message Passing Fundamentals.
Chapter Hardwired vs Microprogrammed Control Multithreading
Memory Management 2010.
Chapter 17 Parallel Processing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
History of Microprocessor MPIntroductionData BusAddress Bus
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
GPU Architecture and Programming
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Single Node Optimization Computational Astrophysics.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
My Coordinates Office EM G.27 contact time:
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Martin Kruliš by Martin Kruliš (v1.1)1.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Intel Many Integrated Cores Architecture
CS427 Multicore Architecture and Parallel Computing
FPGAs in AWS and First Use Cases, Kees Vissers
Mattan Erez The University of Texas at Austin
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Multicore and GPU Programming
Types of Parallel Computers
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors

References: Intel® Xeon Phi™ Coprocessor (codename Knights Corner)  Intel® Many Integrated Core Architecture : An Overview and Programming Models  /ORNL_Elec_Struct_WS_ pdf /ORNL_Elec_Struct_WS_ pdf Intel details Knights Corner architecture at long last  last/#.USLaX6U4uuJ last/#.USLaX6U4uuJ Xeon Phi Update:  Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors - Part 1: Optimization Essentials  coprocessors-part-1-optimization coprocessors-part-1-optimization Results at TeraGrid 2011 conference.  15/adventures_with_hpc_accelerators:_gpus_and_intel_mic_coprocessors.htmlhttp:// 15/adventures_with_hpc_accelerators:_gpus_and_intel_mic_coprocessors.html NCCS introduction to MIC.  NCSA Scientist Backs MICs over GPUs  / /

Paper flow Introduction ( Performance capability, Parallelism-Dual-transforming-tuning advantage). Features of Knight’s corner with architecture diagram (MIC). Tuning your applications for parallel performance (authors favorite point). Performance and Cache Optimizations. Compiler and Programming models (MPI vs. Offload). Xeon Phi vs. GPU(author doesn’t go into details so we will cover it under additional topic). Summary.

Introduction to Many Integrated Core (MIC ) The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelization software tools. [14] Programming tools include OpenMP, OpenCL, [39] Intel Cilk Plus and specialized versions of Intel's Fortran, C++ [40] and math libraries. [14]OpenMPOpenCL [39]Intel Cilk Plus [40] More than 50 cores, multiple threads.

Multicore vs. Many core Fig 01 : Many Core vs. Multicore. Ref

Introduction to Xeon-Phi Intel Xeon Phi Coprocessors-brand name for Many Integrated Core architecture products. Xeon Phi is essentially a parallel x86 supercomputer on a chip. Applications that fully utilize scaling capabilities of Intel Xeon processor based systems now have additional power efficient scaling,vector support and local memory bandwidth. Maintaining the programmability and support associated with Intel Xeon processors.

First Intel Xeon Phi Coprocessor: Knights Corner X86 based SMP on chip with over 50 cores, multiple threads per core(e.g. 4 threads on SE10X) and 512 bit SIMD instructions. Trace history of Pentium design, but with addition of 64 bit support, 4 hardware threads per core. Requires host processor. Runs Linux. Cores clocked at 1 Ghz or more. Each core has access to 512-KB cache locally with high speed access to all other L2 caches.

Typical Platform Fig 02: System Platform. Ref: Connection via PCIe bus. Since the Intel Xeon Phi coprocessor runs a Linux operating system, a virtualized TCP/IP stack could be implemented over the PCIe bus, allowing the user to access the coprocessor as a network node. Thus, any user can connect to the coprocessor through a secure shell and directly run individual jobs or submit batch jobs to it. Multiple Intel Xeon Phi coprocessors can be installed in a single host system. Within a single system, the coprocessors can communicate with each other through the PCIe peer-to-peer interconnect without any intervention from the host. Similarly, the coprocessors can also communicate through a network card without any intervention from the host.

Knight Corner architecture:- Fig 03. Knight Corner Architecture. Ref: Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors.

Ring Architecture Fig 04. Ring Architecture Ref:

When to use Intel Xeon Phi? Application maximizes capabilities of Intel Xeon Processor. High Performance comes from parallel software combined with Parallel Hardware. Fig 05 Software and Hardware Parallelism. Ref: Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors.

When to use Intel Xeon Phi? Application scales well past 100 threads qualify as highly parallel More Parallelism = Better Performance. Two fundamental considerations for application  A) Scaling.  B) Vectorization.

Tuning your applications for Xeon Phi Check Scaling  Create a simple graph of performance, as you run with various number of threads on Intel Xeon Processor based System. Check Vectorization.  Compile your application with and without Vectorization. Compare the performances. Most effective use of Intel Xeon Phi coprocessors will be when most cycles executing are in vector instructions. Check for memory bandwidth  For this to be efficient, application needs to exhibit good locality of reference and utilizes caches well in its core computations.

Tuning your applications for Xeon Phi Communication vs Computation ratio, when using MPI.  for deciding native vs offline model. Strategy of overlapping communication and I/O with computation.

Tools to measure Performances Intel Vtune amplifier XE 2013  To measure computations  L1 compute density Intel trace Analyzer and collector  Profiling MPI communications.

Performance Optimizations Memory access and loop transformations.  (ex cache blocking, loop unrolling, prefetching, tiling…) Blocking or Tiling  Code runs faster when data are reused while they are still in the processor registers or the processor cache. It is frequently possible to block or tile operations so that data are reused before they are evicted from cache. Data Structure transformations:-  Code will run best when data are accessed in sequential address-order from memory. Frequently developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays ( AoS to SoA). Algorithm Selection:  Favor ones those are parallelization and vectorization friendly. Large page considerations:  Use Linuxlibhugetlbfs library – provides easy access to huge mempory pages. preload library to back text, data, malloc() or shared memory with hugepages

Cache Optimizations Maximum effective use of caches Maximize locality of references first organized around threads being used per core and then around all the threads across the coprocessor. Ensuring prefetching is utilized efficiently. Organizing data streams is best c

Offload VS Native model. Coprocessor usage in application. 1]Processor Centric “offload” model:  Program viewed as running on processors and offloading select work to coprocessors. 2] Native model:  Program runs natively on both processors and coprocessors which may communicate with each other.

Offload Model Fortran/C++ pragma support. Future version of Open MP to include offload directives. Offload complex program components that Xeon Phi can process. No worries for placement of ranks on coprocessor cores, to load balance work across all the cores available.

Offload model example You can control the offloading in your own code by issuing a special compiler pragma before the loop you want to offload:  #pragma offload target (mic) Here is an example code snipped where following OpenMP loop would be offloaded to the Phi  #pragma omp parallel for reduction(+:pi)  for (i = 0; i < count; i ++)  { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); }  pi /= count; /* executes on host */ When building code with the Intel compiler, a single additional flag is all that’s necessary:  icc -offload-build -o hello hello.c

Native Model MPI program may run on native with ranks on coprocessors and processors One needs to fit problems in coprocessor environment. Limited memory on coprocessor. Load Balancing with Xeon processor cores and Xeon Phi processor cores

Native model example Running MPI codes Since the Intel Xeon Phi has an OS and is fully network accessible, the existing MPI code can be run on the Phi along with existing compute nodes. Let's assume you have 2 compute nodes (node01, node02), each with a Xeon Phi installed (node01-phi, node02-phi). First we build the Intel MPI code For the host Xeon processor:  mpiicc -o mpi-hello.x86_64 mpi-hello.c For the Xeon Phi processor (add the additional ‘-mmic’ argument):  mpicc -mmic -o mpi-hello.mic mpi-hello.c Run it as follows:  mpirun -np 4 -host node01 mpi-hello.x86_64 \ -host node01-phi mpi-hello.mic \ -host node02 mpi-hello.x86_64 \ -host node02-phi mpi-hello.mic

Recommended Compilers/Programming models. Fortran Programmers- Use Open MP, DO CONCURRENT & MPI C++ Programmers- Intel TBB, Intel Cilk Plus and Open MP. C programmers:- Open MP and Intel Cilk Plus

Xeon Phi vs. GPU. Table 01: Xeon Phi vs. GPU FeaturesXeonPhi\MICGPU ArchitectureX86 basedStreaming processors CachesCoherent cachesShared memory and caches MPIMPI on host & MIC MPI on host only ProgrammingNative/offloadkernels LanguagesC, C++, Fortran and Open MP… Cuda, opencl…

Summary Fundamentals: maximize parallel computations and minimize data movement. Parallel computations: Scaling and Vectorization. “Transforming and Tuning” applications for scaling, vector usage and memory usage for use of Xeon Processors and Phi processors. MIC offers X86 based architecture legacy support for features over GPU new streaming based architecture.

Questions