PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
PAPI for Blue Gene/Q: The 5 BGPM Components Heike Jagode and Shirley Moore Innovative Computing Laboratory University of Tennessee-Knoxville
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Tools for applications improvement George Bosilca.
Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Performance Evaluation on SGI Altix 4700 Guangdeng Liu and Danny Guo.
Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.
August 26 TA: Angela Van Osdol Questions?. What is a computer? Tape drives? Big box with lots of lights? Display with huge letters? Little box with no.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda §, Andrés Márquez §, Jarek Nieplocha §, Kristyn Maschhoff † and.
PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.
Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.
Introduction: Exploiting Linux. Basic Concepts Vulnerability A flaw in a system that allows an attacker to do something the designer did not intend,
March 17, 2005 Roadmap of Upcoming Research, Features and Releases Bart Miller & Jeff Hollingsworth.
University of Washington Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
Full and Para Virtualization
Introduction Why are virtual machines interesting?
Interconnection network network interface and a case study.
Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Fabric Interfaces Architecture – v4
Flow Path Model of Superscalars
What we need to be able to count to tune programs
Understanding Performance Counter Data - 1
Performance Optimization for Embedded Software
Chapter 1 Introduction.
Perfctr-Xen: A framework for Performance Counter Virtualization
What is Computer Architecture?
Reverse engineering through full system simulations
What is Computer Architecture?
Introduction to Virtual Machines
Introduction to Virtual Machines
CSE378 Introduction to Machine Organization
Presentation transcript:

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance

Presentation overview Project objectives PAPI explanation Blue Gene L explanation Current state of research

Project objectives Upgrade PAPI on BG/L Provide interface for network counters Allow Lawrence Livermore National Lab users to also have access to PAPI Using network counters to place tasks optimally on BG/L

PAPI – Intro Courtesy of

PAPI – Intro PAPI useful to profile your own programs. Many tools based on PAPI PapiEx – Command line measurement tool PerfSuite – Aggregate measurement and statistical profiling package and API HPCToolkit – Statistical profiling package Many more!

PAPI – Supported platforms IBM – POWER3, 604, 604e, POWER4 Cray T3E, Cray X1 AMD – Athlon, Opteron Intel – P1 to P4, Itanium I and II UltraSparc I, II & III MIPS R10K, R12K, R14K Alpha

PAPI – Generic Interface Call sequence for generic interface PAPI_library_init – Initialize memory for PAPI’s data structures PAPI_create_eventset – Create an empty list of events PAPI_add_event – Add events to be counted PAPI_start – Begin counting all events within the specified eventset PAPI_stop – Stop all counters and read their current values

PAPI – Events: Presets Presets – list of predefined events implemented on all systems where they can be supported Not all presets available on every architecture (e.g. BG/L has no cache lower than L3 – thus L1 cache hit preset not applicable) Native events form the basic building blocks for PAPI presets

PAPI – Events: Presets Courtesy of

PAPI – Events: Native In addition to the predefined PAPI preset events, the PAPI library also exposes a majority of the events native to each platform Can be added to eventsets in the same manner as presets

PAPI – Events: Native

PAPI – Internals Array of eventsets is the main portion

PAPI – Other features Multiplexing – If there are not enough hardware counters Thread safe – Profiling is thread safe Overflow detection – Hardware counters have limited space

PAPI – PAPI2 vs PAPI3 PAPI 3 significantly reduced overheads for starting, stopping and reading the counters Courtesy of

PAPI – PAPI2 vs PAPI3 Better native event support in PAPI3 Better thread support in PAPI3 Overflow and Profiling enhancements in PAPI3 Myriad bug fixes and code cleanup in PAPI3

PAPI – PAPI2 vs PAPI3 Overlapping eventsets supported in PAPI2 Minor changes in the API – mostly dereferencing variables

Blue Gene L – Intro 65,536 nodes connected in 64 x 32 x 32 3D torus Nodes made up of PowerPC 440 embedded processors Smaller than most super computers Consumes less power

Blue Gene L

Blue Gene L - Networks 3D torus network (node to node) Tree network (broadcasts)

Blue Gene L – HW counters 48 universal performance counters 4 floating point unit counters Counters 32 bit – must use virtual counters to prevent overflow

Blue Gene L – HW counters

Research – Overall goals Network hardware counters new Use network counters to determine traffic between tasks Try to optimize placement of tasks to minimize communication latency Given counts and distances: cost = counts * distance. Minimize over all nodes

Research – Counting First goal to determine what is being counted

Research – Networks For each MPI call – determine which network counters are being used Tree is supposed to be for broadcasts Torus is supposed to be for point to point communication Ambiguities in the specification

Research – Future decisions How to profile a target application Manually insert PAPI instrumentation: a lot of work Instrument binaries with counting code What information to store All counts on each node: a lot of data Sample of all nodes: not as accurate (what if the tasks behave / communicate differently?

Research – Future decisions How to use collected information Profile an application to obtain counter feedback to determine optimized static task layout Dynamically migrate tasks in response to counters