Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

Bentley Water and Wastewater 2004 Edition. Rule-based annotation Cell placement with annotation Bulk assignment of attribute to like elements Automatic.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Intel® performance analyze tools Nikita Panov Idrisov Renat.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Today’s topics Single processors and the Memory Hierarchy

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Corey – An Operating System for Many Cores 謝政宏.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Histograms detect imbalances Variable sizes capture variance Lawrence Livermore National Laboratory Center for Applied Scientific Computing NC STATE UNIVERSITY.

File System Interface CSCI 444/544 Operating Systems Fall 2008.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Bronis R. de Supinski Center for Applied Scientific Computing Lawrence Livermore National Laboratory June 2, 2005 The Most Needed Feature(s) for OpenMP.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

The SNIA NVM Programming Model

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Optimizing RAM-latency Dominated Applications

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

More on Locks: Case Studies

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

MIPS coding. SPIM Some links can be found such as:

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

HDF5 A new file format & software for high performance scientific data management.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.

Chapter 10: File-System Interface Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 10: File-System.

Lawrence Livermore National Laboratory This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Matt Wolfe LC Development Environment Group Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Perseus Design. 2 Lockheed Martin and Government Use Only Architecture Behavioral “signatures” are extracted from a baseline execution Prototype will.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 10 & 11: File-System Interface and Implementation.

August 12, 2004 UCRL-PRES Aug Outline l Motivation l About the Applications l Statistics Gathered l Inferences l Future Work.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 TM 1 Embedded Systems Lab./Honam University ARM Microprocessor Programming Model.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

Cache Simulations and Application Performance Christopher Kerr Philip Mucci Jeff Brown Los Alamos, Sandia.

Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Process Management Process Concept Why only the global variables?

SOFTWARE DESIGN AND ARCHITECTURE

In-situ Visualization using VisIt

Programming Models for SimMillennium

Copyright © 2011, Elsevier Inc. All rights Reserved.

Capriccio – A Thread Model

COT 5611 Operating Systems Design Principles Spring 2014

COT 5611 Operating Systems Design Principles Spring 2012

ARM ORGANISATION.

Presentation transcript:

Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC Dissecting On-node Memory Performance with MemAxes Petascale Tools Workshop 2014 Alfredo Gimenez *, Todd Gamblin †, Martin Schulz †, Peer-Timo Bremer †, Barry Rountree †, Abhinav Bhatele †, Ilir Jusufi *, and Bernd Hammann * Madison, WI August 4-7, 2014 † LLNL * UC Davis

Lawrence Livermore National Laboratory LLNL-PRES Memory Access Sampling Recent hardware additions allow us to precisely sample events, including memory accesses Intel PEBS, AMD IBS Memory access samples contain: The instruction pointer The address accessed How many core clock cycles elapsed during the access Where in the memory hierarchy the address was resolved (e.g. L1 cache, Local RAM, Remote RAM) We need a way to meaningfully interpret these samples

Lawrence Livermore National Laboratory LLNL-PRES Can get these from tools Need help from app Adding Context Can better understand memory references with appropriate context Contexts include: – The code – The node hardware topology – Calling context (call path) – The application (e.g. fluid dynamics) Other work by Liu & Mellor-Crummey has looked at mapping latency & access patterns to particular variables, call paths, and access patterns.

Lawrence Livermore National Laboratory LLNL-PRES We can already get coarse-grained application context for some codes Physics data is available in data structures Time steps are easy to mark in the code Per-process performance – easy to get – just turn on counters at the beginning of the run – read them periodically. What if we want finer-grained attribution? – How to tie measurements to data structures? – How to slice and dice the data? Aluminum FLOP/s per MPI process

Lawrence Livermore National Laboratory LLNL-PRES Node topology is easy to get, but not shown clearly. PEBS provides metadata for node topology Want to highlight connections clearly to show: – Load distribution – Bandwidth – Resource contention Existing visualization from hwloc (right) – Does not scale – Clutters connections between components

Lawrence Livermore National Laboratory LLNL-PRES We have developed a measurement tool for collecting detailed context *SMT: (Semantic Memory Tree) data structure used to map callbacks sampled instruction operands Use PEBS sampling for hardware information Supplement with application instrumentation for mapping addresses to physical coordinates *

Lawrence Livermore National Laboratory LLNL-PRES Currently the developer has to instrument the application manually Add calls to get metadata for allocated objects: 1.Label string 2.Start and end addresses 3.Size of each element 4.Number of elements 5.Callback to map address to physical coordinates Metadata must be provided by the programmer – Could easily be implemented in libraries – Lots of common mesh libraries would be interesting for this.

Lawrence Livermore National Laboratory LLNL-PRES Instrumentation Specify Data Objects Add additional semantic attributes and define attribution function (optional)

Lawrence Livermore National Laboratory LLNL-PRES Semantic Memory Tree S emantic M emory R ange T ree Instrumentation Velocity Pressure Temp Density Binary Search Tree 0x0 F 0xF 6 0x0 F 0x8 0 0xA 2 0xF 6 0x0 F 0x2 0 0x4 0 0x8 0 0xA 2 0xC 2 0xE 0 0xF 6 Data Buffers Address Ranges Addresses Application Domain Record Performance Data in Application Domain S emantic M emory R ange T ree Instrumentation Velocity Pressure Temp Density Binary Search Tree 0x0 F 0xF 6 0x0 F 0x8 0 0xA 2 0xF 6 0x0 F 0x2 0 0x4 0 0x8 0 0xA 2 0xC 2 0xE 0 0xF 6 Data Buffers Address Ranges Addresses Application Domain Record Performance Data in Application Domain Velocity Pressure Temp Density Binary Search Tree 0x0 F 0xF 6 0x0 F 0x8 0 0xA 2 0xF 6 0x0 F 0x2 0 0x4 0 0x8 0 0xA 2 0xC 2 0xE 0 0xF 6 Address Ranges Binary Search Tree Velocity Pressure Temp Density 0x0F0xF6 0x0F0x80 0xA20xF6 0x0F0x20 0x400x800xA20xC2 0xE00xF6 Address Ranges Semantic Memory Ranges

Lawrence Livermore National Laboratory LLNL-PRES Lagrangian Hydrodynamics: LULESH 2D 3D 3D with mapped performance data

Lawrence Livermore National Laboratory LLNL-PRES We have developed MemAxes, a tool for analyzing on-node memory performance Measurement component samples memory instructions We map latency information onto A) source code, B) node topology C) Pie chart shows percent of total latency selected D) Parallel coordinates view allows exploration of correlations

Lawrence Livermore National Laboratory LLNL-PRES Linked views clearly show on-node locality problems PIPE R Parallel coordinates view shows correlation between array index and core id in LULESH Linked node topology view shows data motion for highlighted memory operations A contiguous chunk of an array is initially split between threads on four cores Using an optimized affinity scheme, we improve locality Performance improved by 10% Default thread affinity with poor locality Optimized thread affinity with good locality

Lawrence Livermore National Laboratory LLNL-PRES Hyperion Thread/Core Binding Improved cache usage 44% less access cycles 10% total speedup

Lawrence Livermore National Laboratory LLNL-PRES Future work Back-port perf_events API to production TOSS 2 kernel – Currently unable to do fine-grained memory sampling on production machines due to PMU access limits – Affects some Intel thread tools as well More detailed architecture mapping – Sandy Bridge LLC ring interconnect information? – Other node architecture features? Instrument AMR libraries for proper context attribution – Study per-patch memory behavior – Study blocking behavior of solvers How to query large instruction traces effectively?