Presentation is loading. Please wait.

Presentation is loading. Please wait.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Similar presentations


Presentation on theme: "ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors."— Presentation transcript:

1 ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors

2 References: Intel® Xeon Phi™ Coprocessor (codename Knights Corner)  http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Intel® Many Integrated Core Architecture : An Overview and Programming Models  http://www.olcf.ornl.gov/wp-content/training/electronic-structure- 2012/ORNL_Elec_Struct_WS_02062012.pdf http://www.olcf.ornl.gov/wp-content/training/electronic-structure- 2012/ORNL_Elec_Struct_WS_02062012.pdf Intel details Knights Corner architecture at long last  http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long- last/#.USLaX6U4uuJ http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long- last/#.USLaX6U4uuJ Xeon Phi Update:  http://www.advancedclustering.com/news/xeon-phi-update.html http://www.advancedclustering.com/news/xeon-phi-update.html Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors - Part 1: Optimization Essentials  http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi- coprocessors-part-1-optimization http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi- coprocessors-part-1-optimization Results at TeraGrid 2011 conference.  http://www.hpcwire.com/hpcwire/2011-08- 15/adventures_with_hpc_accelerators:_gpus_and_intel_mic_coprocessors.htmlhttp://www.hpcwire.com/hpcwire/2011-08- 15/adventures_with_hpc_accelerators:_gpus_and_intel_mic_coprocessors.html NCCS introduction to MIC.  http://www.nccs.nasa.gov/images/Intro-to-MIC012913.pdf http://www.nccs.nasa.gov/images/Intro-to-MIC012913.pdf NCSA Scientist Backs MICs over GPUs  http://goparallel.sourceforge.net/ncsa-scientist-backs-mics-gpus / http://goparallel.sourceforge.net/ncsa-scientist-backs-mics-gpus /

3 Paper flow Introduction ( Performance capability, Parallelism-Dual-transforming-tuning advantage). Features of Knight’s corner with architecture diagram (MIC). Tuning your applications for parallel performance (authors favorite point). Performance and Cache Optimizations. Compiler and Programming models (MPI vs. Offload). Xeon Phi vs. GPU(author doesn’t go into details so we will cover it under additional topic). Summary.

4 Introduction to Many Integrated Core (MIC ) The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelization software tools. [14] Programming tools include OpenMP, OpenCL, [39] Intel Cilk Plus and specialized versions of Intel's Fortran, C++ [40] and math libraries. [14]OpenMPOpenCL [39]Intel Cilk Plus [40] More than 50 cores, multiple threads.

5 Multicore vs. Many core Fig 01 : Many Core vs. Multicore. Ref http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner.

6 Introduction to Xeon-Phi Intel Xeon Phi Coprocessors-brand name for Many Integrated Core architecture products. Xeon Phi is essentially a parallel x86 supercomputer on a chip. Applications that fully utilize scaling capabilities of Intel Xeon processor based systems now have additional power efficient scaling,vector support and local memory bandwidth. Maintaining the programmability and support associated with Intel Xeon processors.

7 First Intel Xeon Phi Coprocessor: Knights Corner X86 based SMP on chip with over 50 cores, multiple threads per core(e.g. 4 threads on SE10X) and 512 bit SIMD instructions. Trace history of Pentium design, but with addition of 64 bit support, 4 hardware threads per core. Requires host processor. Runs Linux. Cores clocked at 1 Ghz or more. Each core has access to 512-KB cache locally with high speed access to all other L2 caches.

8 Typical Platform Fig 02: System Platform. Ref: http://software.intel.com/en-us/articles/Intel-Xeon-phi-coprocessor-codename-knights-corner. Connection via PCIe bus. Since the Intel Xeon Phi coprocessor runs a Linux operating system, a virtualized TCP/IP stack could be implemented over the PCIe bus, allowing the user to access the coprocessor as a network node. Thus, any user can connect to the coprocessor through a secure shell and directly run individual jobs or submit batch jobs to it. Multiple Intel Xeon Phi coprocessors can be installed in a single host system. Within a single system, the coprocessors can communicate with each other through the PCIe peer-to-peer interconnect without any intervention from the host. Similarly, the coprocessors can also communicate through a network card without any intervention from the host.

9 Knight Corner architecture:- Fig 03. Knight Corner Architecture. Ref: Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors.

10 Ring Architecture Fig 04. Ring Architecture Ref: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

11 When to use Intel Xeon Phi? Application maximizes capabilities of Intel Xeon Processor. High Performance comes from parallel software combined with Parallel Hardware. Fig 05 Software and Hardware Parallelism. Ref: Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors.

12 When to use Intel Xeon Phi? Application scales well past 100 threads qualify as highly parallel More Parallelism = Better Performance. Two fundamental considerations for application  A) Scaling.  B) Vectorization.

13 Tuning your applications for Xeon Phi Check Scaling  Create a simple graph of performance, as you run with various number of threads on Intel Xeon Processor based System. Check Vectorization.  Compile your application with and without Vectorization. Compare the performances. Most effective use of Intel Xeon Phi coprocessors will be when most cycles executing are in vector instructions. Check for memory bandwidth  For this to be efficient, application needs to exhibit good locality of reference and utilizes caches well in its core computations.

14 Tuning your applications for Xeon Phi Communication vs Computation ratio, when using MPI.  for deciding native vs offline model. Strategy of overlapping communication and I/O with computation.

15 Tools to measure Performances Intel Vtune amplifier XE 2013  To measure computations  L1 compute density Intel trace Analyzer and collector  Profiling MPI communications.

16 Performance Optimizations Memory access and loop transformations.  (ex cache blocking, loop unrolling, prefetching, tiling…) Blocking or Tiling  Code runs faster when data are reused while they are still in the processor registers or the processor cache. It is frequently possible to block or tile operations so that data are reused before they are evicted from cache. Data Structure transformations:-  Code will run best when data are accessed in sequential address-order from memory. Frequently developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays ( AoS to SoA). Algorithm Selection:  Favor ones those are parallelization and vectorization friendly. Large page considerations:  Use Linuxlibhugetlbfs library – provides easy access to huge mempory pages. preload library to back text, data, malloc() or shared memory with hugepages

17 Cache Optimizations Maximum effective use of caches Maximize locality of references first organized around threads being used per core and then around all the threads across the coprocessor. Ensuring prefetching is utilized efficiently. Organizing data streams is best c

18 Offload VS Native model. Coprocessor usage in application. 1]Processor Centric “offload” model:  Program viewed as running on processors and offloading select work to coprocessors. 2] Native model:  Program runs natively on both processors and coprocessors which may communicate with each other.

19 Offload Model Fortran/C++ pragma support. Future version of Open MP to include offload directives. Offload complex program components that Xeon Phi can process. No worries for placement of ranks on coprocessor cores, to load balance work across all the cores available.

20 Offload model example You can control the offloading in your own code by issuing a special compiler pragma before the loop you want to offload:  #pragma offload target (mic) Here is an example code snipped where following OpenMP loop would be offloaded to the Phi  #pragma omp parallel for reduction(+:pi)  for (i = 0; i < count; i ++)  { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); }  pi /= count; /* executes on host */ When building code with the Intel compiler, a single additional flag is all that’s necessary:  icc -offload-build -o hello hello.c

21 Native Model MPI program may run on native with ranks on coprocessors and processors One needs to fit problems in coprocessor environment. Limited memory on coprocessor. Load Balancing with Xeon processor cores and Xeon Phi processor cores

22 Native model example Running MPI codes Since the Intel Xeon Phi has an OS and is fully network accessible, the existing MPI code can be run on the Phi along with existing compute nodes. Let's assume you have 2 compute nodes (node01, node02), each with a Xeon Phi installed (node01-phi, node02-phi). First we build the Intel MPI code For the host Xeon processor:  mpiicc -o mpi-hello.x86_64 mpi-hello.c For the Xeon Phi processor (add the additional ‘-mmic’ argument):  mpicc -mmic -o mpi-hello.mic mpi-hello.c Run it as follows:  mpirun -np 4 -host node01 mpi-hello.x86_64 \ -host node01-phi mpi-hello.mic \ -host node02 mpi-hello.x86_64 \ -host node02-phi mpi-hello.mic

23 Recommended Compilers/Programming models. Fortran Programmers- Use Open MP, DO CONCURRENT & MPI C++ Programmers- Intel TBB, Intel Cilk Plus and Open MP. C programmers:- Open MP and Intel Cilk Plus

24 Xeon Phi vs. GPU. Table 01: Xeon Phi vs. GPU FeaturesXeonPhi\MICGPU ArchitectureX86 basedStreaming processors CachesCoherent cachesShared memory and caches MPIMPI on host & MIC MPI on host only ProgrammingNative/offloadkernels LanguagesC, C++, Fortran and Open MP… Cuda, opencl…

25 Summary Fundamentals: maximize parallel computations and minimize data movement. Parallel computations: Scaling and Vectorization. “Transforming and Tuning” applications for scaling, vector usage and memory usage for use of Xeon Processors and Phi processors. MIC offers X86 based architecture legacy support for features over GPU new streaming based architecture.

26 Questions


Download ppt "ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors."

Similar presentations


Ads by Google