ARCHER Tips and Tricks A few notes from the CSE team.

Slides:



Advertisements
Similar presentations
Win8 on Intel Programming Course Desktop : Introduction Cédric Andreolli Intel Software.
Advertisements

Win8 on Intel Programming Course Desktop : Sensors Cédric Andreolli Intel Software
Algorithms and data structures
Scientific codes on a cluster of Itanium-based nodes Joseph Pareti HPTC Consultant HP Germany
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Internet of Things with Intel Edison Web controller
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
Profiling your application with Intel VTune at NERSC
Internet of Things with Intel Edison GPIO on Edison
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Computer System Overview
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Performance Engineering and Debugging HPC Applications David Skinner
Chapter 8: Main Memory.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.
Performance Measurement on kraken using fpmpi and craypat Kwai Wong NICS at UTK / ORNL March 24, 2010.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 8: Main Memory.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
The Cray XC30 “Darter” System Daniel Lucio. The Darter Supercomputer.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Server to Server Communication Redis as an enabler Orion Free
Agilent Technologies Copyright 1999 H7211A+221 v Capture Filters, Logging, and Subnets: Module Objectives Create capture filters that control whether.
Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.
Luiz DeRose © Cray Inc. CSCS July 13-15, 2009 Using Hardware Performance Counters on the Cray XT Luiz DeRose Programming Environment Director.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
Introduction to the XC30 For Users of Cray’s XT5 and XK7 Aaron Vose 1.
Processes and Virtual Memory
ARCHER Advanced Research Computing High End Resource
HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
TOP 10 Thinks you shouldn’t do with/in your database
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Hello world !!! ASCII representation of hello.c.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Operating Systems A Biswas, Dept. of Information Technology.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
W4118 Operating Systems Instructor: Junfeng Yang.
1 Virtual Memory. 2 Outline Case analysis –Pentium/Linux Memory System –Core i7 Suggested reading: 9.7.
ECE232: Hardware Organization and Design
ParaTools ThreadSpotter ParaTools, Inc.
INTEL HYPER THREADING TECHNOLOGY
CSE 153 Design of Operating Systems Winter 2018
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 2: System Structures
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Chapter 4: Threads.
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 153 Design of Operating Systems Winter 2019
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Paging Andrew Whitaker CSE451.
Presentation transcript:

ARCHER Tips and Tricks A few notes from the CSE team

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images.

Outline Using Intel MKL Impact of HyperThreads Showing process/thread placement Performance analysis: hardware counters on ARCHER Debugging: Disabling autotuning in Cray BLAS Enabling and using ATP

Intel MKL MKL can be used as an alternative for LibSci We have seen cases where either is better Worth experimenting Not interfaced through modules Linking using GNU -L$(MKLROOT)/lib/intel64/ -Wl,--start-group -lmkl_sequential \ -lmkl_gf_lp64 -lmkl_core -Wl,--end-group –ldl Linking using Intel -L$(MKLROOT)/lib/intel64/ -Wl,--start-group -lmkl_intel_lp64 \ -lmkl_core -lmkl_sequential -Wl,--end-group –ldl

Intel MKL (cont.) The link line is reasonably complicated. Use the MKL Link Line Advisor: For ARCHER select: Product: Intel Composer XE 2013 SP1 OS: Linux Usage model for Coprocessor: None Architecture: Intel(R) 64 Linking: Static Interface Layer: LP64 (32-bit Integer) (MPI: MPICH2 if required)

Impact of HyperThreads HyperThreads allow up to 2 processes/threads to run concurrently on a single physical core Managed in hardware so context switch is fast Use CPU resource while one thread is stalled Very program dependent Even a small improvement is worth it (as it is free) Worth testing if it is useful for your program aprun syntax (2 nodes): aprun –j 2 –n 96 –N 48 …

Hyperthreading example performance XC30: Sandy Bridge (8 cores), fully populated nodes Effects of Hyper-Threading on the NERSC workload on Edison NAMD VASP

Show Process/Thread Placement Process/thread placement can have a large impact on performance Particularly when underpopulating nodes or running mixed-mode (MPI/OpenMP) code. Add the following lines to your job submission script: export MPICH_CPUMASK_DISPLAY=1 export MPICH_RANK_REORDER_DISPLAY=1

Placement (cont.) [PE_0]: MPI rank order: Using default aprun rank ordering. [PE_0]: rank 0 is on nid02421 [PE_0]: rank 1 is on nid02421 [PE_0]: rank 2 is on nid02421 … [PE_0]: rank 24 is on nid02505 [PE_0]: rank 25 is on nid02505 [PE_0]: rank 26 is on nid02505 …

Placement (cont.) [PE_0]: cpumask set to 1 cpu on nid02421, cpumask = [PE_34]: cpumask set to 1 cpu on nid02505, cpumask = [PE_33]: cpumask set to 1 cpu on nid02505, cpumask = [PE_35]: cpumask set to 1 cpu on nid02505, cpumask = [PE_47]: cpumask set to 1 cpu on nid02505, cpumask = …

Hardware Counters on ARCHER CrayPAT allows you to monitor performance at the hardware level Specify set of performance counters using the PAT_RT_PERFCTR environment variable in script that is running instrumented code: PAT_RT_PERFCTR=1 (Group = 1 shows a summary with floating-point and cache metrics.)

================================================================= Total PERF_COUNT_HW_CACHE_L1D:ACCESS PERF_COUNT_HW_CACHE_L1D:PREFETCH PERF_COUNT_HW_CACHE_L1D:MISS CPU_CLK_UNHALTED:THREAD_P CPU_CLK_UNHALTED:REF_P DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK DTLB_STORE_MISSES:MISS_CAUSES_A_WALK L2_RQSTS:ALL_DEMAND_DATA_RD L2_RQSTS:DEMAND_DATA_RD_HIT User time (approx) secs cycles CPU_CLK 2.962GHz TLB utilization refs/miss avg uses D1 cache hit,miss ratios 94.8% hits 5.2% misses D1 cache utilization (misses) refs/miss avg hits D2 cache hit,miss ratio 87.8% hits 12.2% misses D1+D2 cache hit,miss ratio 99.4% hits 0.6% misses D1+D2 cache utilization refs/miss avg hits D2 to D1 bandwidth MB/sec bytes

Disable Cray BLAS autotuning If you are debugging and use the Cray LibSci library then you may want to disable autotuning. Ensures autotuning is not causing the error. Add: CRAYBLAS_AUTOTUNING_OFF=1 to your job scripts.

Using ATP ATP (Abnormal Termination Processing) catches dying applications and produces a merged stack backtrace Useful for getting more information on crashes Set: ATP_ENABLED=1 in your job submission script. There is no need to recompile to use ATP

Using ATP (cont.) When your program crashes, ATP will: Produce a stack trace of the first failing process Produce a visualisation of every processes stack trace Generate a selection of relevant core files Visualise the merged stack trace using statview: module add stat statview atpMergedBT.dot Very simple way to start the debugging process

statview (thanks to Cray)