The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.

Slides:

Advertisements

Similar presentations

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Xtensa C and C++ Compiler Ding-Kai Chen

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

Eos Compilers Fernanda Foertter HPC User Assistance Specialist.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

High Performance Computing with AMD Opteron Maurizio Davini.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Presented by Rengan Xu LCPC /16/2014

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.

10/9/01CSE Project CSE 260 – Introduction to Parallel Computation 2-D Wave Equation Suggested Project.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

NERCS Users’ Group, Oct. 3, 2005 NUG Training 10/3/2005 Logistics –Morning only coffee and snacks –Additional drinks $0.50 in refrigerator in small kitchen.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 NERSC Software Roadmap David Skinner, NERSC Division, Berkeley Lab.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

Trilinos 101: Getting Started with Trilinos November 7, :30-9:30 a.m. Mike Heroux Jim Willenbring.

NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services

Copyright © 2009 Techtronics'09 by GCECT 1 Presents, De Code C De Code C is a C Programming competition, which challenges the participants to solve problems.

Modifying Floating-Point Precision with Binary Instrumentation Michael Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Performance Optimization Getting your programs to run faster CS 691.

Eagle: Maturation and Evolution 17th Annual Tcl Conference Joe Mistachkin.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.

A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.

Performance Optimization Getting your programs to run faster.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Experts in numerical algorithms and HPC services Compiler Requirements and Directions Rob Meyer September 10, 2009.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

Feedback from LHC Experiments on using CLHEP Lorenzo Moneta CLHEP workshop 28 January 2003.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

2011/08/23 國家高速網路與計算中心 Advanced Large-scale Parallel Supercluster.

Third-party software plan Zhengji Zhao NERSC User Services NERSC User Group Meeting September 19, 2007.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

How to configure, build and install Trilinos November 2, :30-9:30 a.m. Jim Willenbring.

Communication Optimizations in Titanium Programs Jimmy Su.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.

SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Performance Analysis, Tools and Optimization

The HP OpenVMS Itanium® Calling Standard

Floating Point Representations: Accuracy and Stability

Compiler Ecosystem November 22, 2018 Computation Products Group.

Eagle: Maturation and Evolution

The SGI Pro64 Compiler Infrastructure

Presentation transcript:

The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05

2 Outline Compiling and Linking. Optimization. Libraries. Debugging. Porting from Seaborg and other systems.

3 Pathscale Compilers Default compilers: Pathscale Fortran 90, C, and C++. Module “path” is loaded by default and points to the current default version of the Pathscale compilers (currently 2.2.1). Other versions available: module avail path. Extensive vendor documentation available on-line at Commercial product: well supported and optimized.

4 Compiling Code Compiler invocation: –No MPI: pathf90, pathcc, pathCC. –MPI: mpif90, mpicc, mpicxx The mpi compiler invocation will use the currently loaded compiler version. The mpi and non-mpi compiler invocations have the same options and arguments.

5 Compiler Optimization Options 4 numeric levels –On where n ranges from 0 (no optimization) to 3. Default level: -O2 (unlike IBM) –g without a –O option changes the default to –O0.

6 -O1 Optimization Minimal impact on compilation time compared to –O0 compile. Only optimizations applied to straight line code (basic blocks) like instruction scheduling.

7 -O2 Optimization Default when no optimization arguments given. Optimizations that always increase performance. Can significantly increase compilation time. -O2 optimization examples: –Loop nest optimization. –Global optimization within a function scope. –2 passes of instruction scheduling. –Dead code elimination. –Global register allocation.

8 -O3 Optimization More extensive optimizations that may in some cases slow down performance. Optimizes loop nests rather than just inner loops, i.e. inverts indices, etc. “Safe” optimizations – produces answers identical with those produced by –O0. NERSC recommendation based on experiences with benchmarks.

9 -Ofast Optimization Equivalent to -O3 -ipa -fno-math-errno -OPT:roundoff=2:Olimit=0:div_split=ON:alias=typed. ipa – interprocedural analysis. –Optimizes across functional boundaries. –Must be specified both at compile and link time. Aggressive “unsafe” optimizations: –Changes order of evaluation. –Deviates from IEEE 754 standard to obtain better performance. There are some known problems with this level of optimization in the current release,

10 NAS B Serial Benchmarks Performance (MOP/S) Seaborg Best -O0-O1-O2-O3-Ofast BT CG EP FT did not compile IS LU MG SP

11 NAS B Serial Benchmarks Compile Times (seconds) -O0-O1-O2-O3-Ofast BT CG EP FT did not compile IS LU MG SP

12 NAS B Optimization Arguments Used by LNXI Benchmarkers BenchmarkArguments BT-O3 -ipa -WOPT:aggstr=off CG-O3 -ipa -CG:use_movlpd=on -CG:movnti=1 EP-LNO:fission=2 -O3 -LNO:vintr=2 FT-O3 -LNO:opt=0 IS-Ofast -DUSE_BUCKETS LU-Ofast -LNO:fusion=2:prefetch=0:full_unroll=10:ou_max=5 -OPT:ro=3:fold_unsafe_relops=on:fold_unsigned_relops=on: unroll_size=256:unroll_times_max=16:fast_complex -CG:cflow=off:p2align_freq=1 -fno-exceptions MG-O3 -ipa -WOPT:aggstr=off -CG:movnti=0 SP-Ofast

13 NAS C FT (32 Proc) OptimizationMops/ProcCompile Time (seconds) Seaborg Best 86.5N/A -O O O O Ofast

14 SuperLU MPI Benchmark Based on the SuperLU general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. Mostly C with some Fortran 90 routines. Run on 64 processors/32 nodes. Uses BLAS routines from ACML.

15 SLU (64 procs) OptimizationElapsed run time (seconds) Compile Time (seconds) Seaborg Best 742.5N/A -O O O O OfastN/ADid not compile

16 Jacquard Applications Acceptance Benchmarks BenchmarkSeaborgJacquardJacquard Optimizations NAMD (32 proc)2384 sec554-O3 -ipa -fno-exceptions Chombo Serial1036 sec138-O3 -OPT:Ofast -OPT:Olimit= fno-math-errno -finline Chombo Parallel (32 proc) 773 sec161-O3 -OPT:Ofast -OPT:Olimit= fno-math-errno -finline CAM Serial sec264-O2 CAM Parallel (32 proc) 75 sec13.2-O2 SuperLU (64 proc) sec212-O3 -OPT:Ofast -fno-math-errno

17 ACML Library AMD Core Math Library - set of numerical routines tuned specifically for AMD64 platform processors. –BLAS –LAPACK –FFT To use with pathscale: –module load acml (built with pathscale compilers) –Compile and link with $ACML To use with gcc: –module load acml_gcc (build with pathscale compilers) –Compile and link with $ACML

18 Matrix Multiply Optimization Example 3 ways to multiply 2 dense matrices –Directly in Fortran with nested loops –Matmul F90 intrinsic –dgemm from ACML Example by 1000 double precision matrices. Order of indices: ijk means – do i=1,n – do j=1,n – do k=1,n

19 Fortran Matrix Multiply MFLOPs Seaborg Best -O0 -O1 -O2 -O3-Ofast ijk jik ikj kij jki kji matmul dgemm

20 Debugging Etnus Totalview debugger has been installed on the system. Still in testing mode, but it should be available to users soon.

21 Porting codes Jacquard is a linux system so gnu tools like gmake are the defaults. Pathscale compilers are good, but new, so please report any evident compiler bugs to consult.