Performance Analysis on Blue Gene/P Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Parallel Processing with OpenMP
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
PAPI for Blue Gene/Q: The 5 BGPM Components Heike Jagode and Shirley Moore Innovative Computing Laboratory University of Tennessee-Knoxville
Intel® performance analyze tools Nikita Panov Idrisov Renat.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Automated Instrumentation and Monitoring System (AIMS)
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 Lecture 6 Performance Measurement and Improvement.
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Performance Engineering and Debugging HPC Applications David Skinner
Using DOCK to characterize protein ligand interactions Sudipto Mukherjee Robert C. Rizzo Lab.
Advanced Computing Technology Center © 2005 IBM Corporation The IBM High Performance Computing Toolkit Guojing Cong.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
MCTS Guide to Microsoft Windows Vista Chapter 11 Performance Tuning.
MCTS Guide to Microsoft Windows 7
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
BG/Q Performance Tools Scott Parker Mira Community Conference: March 5, 2012 Argonne Leadership Computing Facility.
Deep Computing © 2008 IBM Corporation The IBM High Performance Computing Toolkit Advanced Computing Technology Center
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Lab 2 Parallel processing using NIOS II processors
Chapter 4 Message-Passing Programming. The Message-Passing Model.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
MPI and OpenMP.
Single Node Optimization Computational Astrophysics.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center.
CITA 171 Section 1 DOS/Windows Introduction. DOS Disk operating system (DOS) –Term most often associated with MS-DOS –Single-tasking operating system.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Compute and Storage For the Farm at Jlab
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to OpenMP
MCTS Guide to Microsoft Windows 7
Introduction to OpenMP
Advanced TAU Commander
Jun Doi Tokyo Research Laboratory IBM Research
Hybrid Parallel Programming
Introduction to OpenMP
Multithreading Why & How.
Parallel Computing Explained How to Parallelize a Code
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Performance Analysis on Blue Gene/P Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University

From microprocessor to the full Blue Gene P/system

IBM XL Compilers The commands to invoke the compiler  on BG/L : blrts_xlc, blrts_xlC, blrts_xlf,..  on BG/P: bgxlc, bgxlc++,bgxl f, … How to compile MPI Programs on BG/P Option1: Using bgxl prefix, the programmer explicitly identifies all the libraries and include files. -L/bgsys/drivers/ppcfloor/comm/lib -lmpich.cnk -ldcmfcoll.cnk -ldcmf.cnk -lpthread -lrt -L/bgsys/drivers/ppcfloor/comm/runtime/SPI -lSPI.cna Option2: mpixlc, mpixlcxx, mpixlf,… These scripts take care of the proper order of in which libraries should be called.

Shared-memory parallelism  BG/P system supports shared memory parallelism on single nodes.  Compiler can automatically locate and countable loop is automatically parallelized if  The order in which loop iterations start and end doesn’t affect the result  The loop doesn’t contain I/O operations  The code is compiled with a thread-safe version of compiler (_r suffix) mpixlc_r, mpixlcxx_r, mpixlf77_r…  -qsmp = auto option is in effect.

OpenMP pragmas -qsmp=omp parallelizes based on OpenMP. #pragma omp parallel private(i,tid){ tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("Thread %d starting...\n",tid); #pragma omp for for (i=0; i<N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */  For loop the master thread creates additional threads. The loop is executed by all threads.  Variables are by default shared.

Compiler optimization options For specific architecture  BG / L  two 32-bit PowerPC440  BG / P  four 32-bit PowerPC450 -qarch: generates parallel instruction, double precision floating point multiply add unit (double hummer) -qtune: optimizes object code. Single FPU only. -O3 -qarch=450d -qtune=450 To list all options in effect, specifiy the -qlistopt

Mathematical Acceleration Subsystem Libraries(MASS)  MASS libraries offer improved performance over the standard mathematical library routines, are thread-safe.  MASS Libraries for Blue Gene v.4.4 [*] is installed to $home directory.  On the linker command it is explicitly specified. -L$home/mass/4.4/bgp/lib -lmass *

EXECUTION PROCESS MODES Symmetrical MultiProcessing(SMP) Node Mode  Each compute node executes a single MPI task per node with a maximum of 4 threads Virtual Node (VN) Mode  In this mode, kernel runs four MPI task on each compute node. Dual Node (DUAL) Mode  Each compute node executes two MPI tasks per node. Each task in this mode get the half memory. It runs two threads per task.

Specify the mode in the script file SMP(1x4) -mode smp VN (4x1) -mode vn DUAL(2x2) -mode dual

TOTAL PERFORMANCE Computation CommunicationI/O XprofilerHPMMPI ProfilerMIO Library

MPI Profiler and Tracer  Compiling and linking  The application must be compiled with the -g option.  To link the application with the trace library, those options should be added to the command line. -L -lmpitrace -llicense  -lmpitrace should be before the -lmpich  Set environment variables  Outputs  Plain text file (mpi_profile.0)  The VIZ file (mpi_profile_0.viz)  Trace file (single_trace)

Output I: Plain text file  We will get profile data for rank 0 and ranks with min-max-median MPI communication time. MPI Routine #callsavg.bytes time(sec ) MPI_Comm_size MPI_Comm_rank MPI_Bsend MPI_Recv MPI_Buffer_attach MPI_Barrier MPI_Allgather MPI_Allreduce total communication time = seconds. total elapsed time = seconds.

 mpi_profile.0 contains timing summaries from each tasks. Communication summary for all tasks: minimum communication time = sec for task 351 median communication time = sec for task 57 maximum communication time = sec for task 7 taskid xcoord ycoord zcoord procid total_comm(sec) avg_hops

Output II: The VIZ file III : The trace file Using the IBM HPCT Graphical User Interface(GUI), we can view the profile data presented in the form of an XML file.  The Viz file:peekperf mpi_profile_0.viz  Trace file:peekview single_trace

Xprofiler - CPU Profiling tool  Xprofiler provides quick access to the profiled data and helps users identify the functions that are the most CPU- intensive.  Using the GUI, it is very easy to find the application's performance-critical areas.  Full profiling - call graph info, statement level profiling, basic block profiling, and machine instruction profiling: -pg -g to the compile and link option  Xprofiler gmon.out.taskid

Xprofiler The size of green, solid-filled boxes indicates CPU usage. The height of the box: time spent on executing itself. The width of the box: time spent on executing itself and descendent functions. The number on arrows: call counts.

Hardware Performance Monitoring (HPM)  Provides comprehensive reports of hardware events that are critical to performance on BG/P system.  Hardware performance metrics:  the number of misses on all cache levels  the number of floating point instructions executed  Libhpm is the library that provides a programming interface to start, stop and print performance counting.

Libhpm Library 1. Initialize the Performance Counter unit 2. Start counting in a block marked by the label 3. Stop counting in a block marked by the label 4. Print counters for all blocks HPM_Init(void);call hpm_init() HPM_Start(“WorkC”); call hpm_start(‘WorkF’) do_work;do_work HPM_Stop(“WorkC”); call hpm_stop(‘WorkF’) HPM_Print(); call hpm_print()

hpm_data.taskid Hardware counter report for BGP node 0, coordinates. BGP counter mode = 0, trigger = level high Work1, call count = 548, cycles = : BGP_PU0_JPIPE_INSTRUCTIONS BGP_PU0_JPIPE_ADD_SUB BGP_PU0_JPIPE_LOGICAL_OPS BGP_PU0_JPIPE_SHROTMK … BGP_PU0_DCACHE_LINEFILLINPROG BGP_PU0_ICACHE_LINEFILLINPROG BGP_PU0_DCACHE_MISS BGP_PU0_DCACHE_HIT BGP_PU0_ICACHE_MISS BGP_PU0_ICACHE_HIT 22 0 BGP_PU0_FPU_ADD_SUB_ BGP_PU0_FPU_MULT_1 …

Tuning and Analysis Utilities (TAU)

Performance Analysis Tool TAU provides a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel applications.

Some definitions Scaling studies involve changing the degree of parallelism.  Strong Scaling  Weak Scaling Speed UpEfficiency T s / T p (n)T s / (n*T p (n))

Fixed problem size, more processor

Problem size grows with more processor

New York Blue BG/L System 18 racks compute nodes (36864 CPUs) BG/P System 2 racks 2048 compute nodes (8192 CPUs)

Argonne Leadership Computing Facility (ALCF) BG/P Surveyor System 13.6 TF/s 1 rack BG/P 1024 compute nodes (4096 CPUs) BG/P Intrepid System TF/s 40 rack BG/P compute nodes ( CPUs)

IBM Redbooks:  IBM System Blue Gene Solution: Blue Gene/P Application Development  IBM System Blue Gene Solution: High Performance Computing Toolkit for Blue Gene/P  IBM XL C/C++Advanced Edition for Linux, 9.0 Compiler Reference Advanced Computing Technology Center New York Blue Argonne Leadership Computing Facility

Leap to Petascale Workshop ALCF & Blue Gene July