High Performance Computing with AMD Opteron Maurizio Davini.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Introduction CS 524 – High-Performance Computing.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Performance Libraries: Intel Math Kernel Library (MKL) Intel Software College.
Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa.
Using VASP on Ranger Hang Liu. About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Lecture 8: Caffe - CPU Optimization
SUSE Linux Enterprise Server Administration (Course 3037) Chapter 4 Manage Software for SUSE Linux Enterprise Server.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
The Cray XC30 “Darter” System Daniel Lucio. The Darter Supercomputer.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Performance Issues Application Programmers View John Cownie HPC Benchmark Engineer.
The Performance of Microkernel-Based Systems
Performance Optimization Getting your programs to run faster CS 691.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.
BNL / CASPUR / CERN Price/Performance estimates for some compute platforms (AMD, Intel) Pavel Nevski BNL /CERN October 2004.
CINT C++ Interpreter update ROOT2001 at Fermi-Lab Masaharu Goto.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Performance Optimization Getting your programs to run faster.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
2011/08/23 國家高速網路與計算中心 Advanced Large-scale Parallel Supercluster.
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Developing a 64-bit Strategy Craig McMurtry Developer Evangelist, Software Vendors Developer and Platform Evangelism Microsoft Corporation.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Single Node Optimization Computational Astrophysics.
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
USEIMPROVEEVANGELIZE ● Yue Chao ● Sun Campus Ambassador High-Performance Computing with Sun Studio 12.
Compiler Ecosystem November 22, 2018 Computation Products Group.
Multi-Core Programming Assignment
Types of Parallel Computers
Presentation transcript:

High Performance Computing with AMD Opteron Maurizio Davini

Agenda OS Compilers Libraries Some Benchmark results Conclusions

64-Bit Operating Systems Recommendations and Status  SUSE SLES 9 with latest Service Pack available  Has technology for supporting latest AMD processor features  Widest breadth of NUMA support and enabled by default  Oprofile system profiler installable as an RPM and modularized  complete support for static & dynamically linked 32-bit binaries  Red Hat Enterprise Server 3.0 Service Pack 2 or later  NUMA features support not as complete as that of SUSE SLES 9  Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn ’ t satisfactory  only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit)  Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applications

AMD Opteron Compilers PGI,, GNU, PGI, Pathscale, GNU, Absoft Intel, Microsoft and SUN

Compiler Comparisons Table Critical Features Supported by x86 Compilers VectorSIMD Suppo rt Peel s Vect or Loop s Glob al IPA Ope n MPLinksACML Librari es ProfileGuided Feedba ck Align s Vect or Loop s Parallel Debugg ers Large Array Suppo rt Mediu m Memor y Model PGI GNU Intel Pathscale Absoft SUN Microsoft                                        

Tuning Performance with Compilers Maintaining Stability while Optimizing  STEP 0: Build application using the following procedure:  compile all files with the most aggressive optimization flags below: -tp k8-64 –fastsse  if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k8-64 –fast –Mscalarsse  if problems persist compile at Optimization level 1: -tp k8-64 –O0  STEP 1: Profile binary and determine performance critical routines  STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability

PGI Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases:  Most aggressive: -tp k8-64 –fastsse –Mipa=fast  enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling strongly recommended for any single precision source code  strongly recommended for any single precision source code  Middle of the ground: -tp k8-64 –fast –Mscalarsse   enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results  in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code  Least aggressive: -tp k8-64 –O0 (or –O1)

PGI Compiler Flags Functionality Flags  -mcmodel=medium   use if your application statically allocates a net sum of data structures greater than 2GB  -Mlarge_arrays   use if any array in your application is greater than 2GB  -KPIC   use when linking to shared object (dynamically linked) libraries  -mp  OpenMPSGI  process OpenMP/SGI directives/pragmas (build multi-threaded code)  -Mconcur   attempt auto-parallelization of your code on SMP system with OpenMP

Absoft Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases:  Most aggressive: -O3  loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases strongly recommended for any single precision source code  strongly recommended for any single precision source code  Middle of the ground: -O2   enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling.  in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code  Least aggressive: -O1

Absoft Compiler Flags Functionality Flags  -mcmodel=medium   use if your application statically allocates a net sum of data structures greater than 2GB  -g77   enables full compatibility with g77 produced objects and libraries (must use this option to link to GNU ACML libraries)  -fpic   use when linking to shared object (dynamically linked) libraries  -safefp   performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of NaNs

Pathscale Compiler Flags Optimization Flags  Most aggressive: -Ofast  Equivalent to –O3 –ipa –OPT:Ofast –fno-math-errno  Aggressive : -O3  optimizations for highest quality code enabled at cost of compile time  Some generally beneficial optimization included may hurt performance  Reasonable: -O2  Extensive conservative optimizations  Optimizations almost always beneficial  Faster compile time  Avoids changes which affect floating point accuracy.

Pathscale Compiler Flags Functionality Flags  -mcmodel=medium  use if static data structures are greater than 2GB  -ffortran-bounds-check  (fortran) check array bounds  -shared  generate position independent code for calling shared object libraries  Feedback Directed Optimization  STEP 0: Compile binary with -fb_create_fbdata  STEP 1: Run code collect data  STEP 2: Recompile binary with -fb_opt fbdata  -march=(opteron|athlon64|athlon64fx)  Optimize code for selected platform (Opteron is default)

ACML 2.1 Features  Features  BLAS, LAPACK, FFT Performance  Open MP Performance  ACML 2.5 Snap Shot – Soon to be released

Components of ACML BLAS, LAPACK, FFTs  Linear Algebra (LA)  Basic Linear Algebra Subroutines (BLAS) oLevel 1 (vector-vector operations) oLevel 2 (matrix-vector operations) oLevel 3 (matrix-matrix operations) oRoutines involving sparse vectors  Linear Algebra PACKage (LAPACK) oleverage BLAS to perform complex operations o28 Threaded LAPACK routines  Fast Fourier Transforms (FFTs)  1D, 2D, single, double, r-r, r-c, c-r, c-c support  C and Fortran interfaces

64-bit BLAS Performance DGEMM ( Double Precision General Matrix Multiply )

64-bit FFT Performance (non-power of 2) MKL vs ACML on 2.2 Ghz Opteron

64-bit FFT Performance (non-power of 2) 2.2 Ghz Opteron vs 3.2 Ghz XeonEMT

Multithreaded LAPACK Performance Double Precsion (LU, Cholesky, QR Factorize/Solve)

Conclusion and Closing Points  How good is our performance? Averaging over 70 BLAS/LAPACK/FFT routines Computation weighted average All measurements performed on an 4P AMD Opteron TM 844 Quartet Server ACML 32-bit is 55% faster than MKL ACML 64-bit is 80% faster than MKL 6.1

64-ACML 2.5 Snapshot Small Dgemm Enhancements

ATLSIM: A full-scale GEANT3 simulation of ATLAS detector (P.Nevski) (typical LHC Higgs events) SixTrack: Tracking of two particles in a 6-dimensional phase space including synchrotron oscillations (F.Schmidt) ( Sixtrack benchmark code: E.McIntosh ( CERN U: Ancient “CERN Units” Benchmark (E.McIntosh) Recent Caspur Results ( thanks to M.Rosati) Benchmark suites

What was measured On both platforms, we were running one or two simultaneous jobs for each of the benchmarks. On Opteron, we used the SuSE “numactl” interface to make sure that at any time each of the two processors makes use of the right bank of memory. Example of submission, 2 simultaneous jobs: Intel:./TestJob;./TestJob AMD: numactl –cpubind=0 –membind=0./TestJob; numactl –cpubind=1 –membind=1./TestJob

Results CERN UnitsSixTrack (seconds/run) ATLSIM (seconds/event) 1 job2 jobs1 job2 jobs1 job2 jobs Intel Nocona , , ,484 AMD Opteron , , ,389 While both machines behave in a similar way when only one job is run, the situation changes in a visible manner in the case of two jobs. It may take up to 30% more time to run two simultaneous jobs on Intel, while on AMD there is a notable absence of any visible performance drop.

HEP Software bench

Hep software Bench

HEP software bench

HEP Software bench

An original MPI work on AMD Opteron We got access to the MPI wrapper-library source Environment: –4 way servers –Myrinet interconnect –Linux 2.6 kernel –LAM MPI We inserted libnuma calls after MPI_INIT to bind the newly-created MPI tasks to specific processors –We avoid unnecessary memory traffic by having each processor accessing its own memory

>20% improvement

Conclusioni AMD Opteron: HPEW High Performance Easy Way