Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by.

Similar presentations


Presentation on theme: "Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by."— Presentation transcript:

1 Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Dissecting On-node Memory Performance with MemAxes Petascale Tools Workshop 2014 Alfredo Gimenez *, Todd Gamblin †, Martin Schulz †, Peer-Timo Bremer †, Barry Rountree †, Abhinav Bhatele †, Ilir Jusufi *, and Bernd Hammann * Madison, WI August 4-7, 2014 † LLNL * UC Davis

2 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Memory Access Sampling Recent hardware additions allow us to precisely sample events, including memory accesses Intel PEBS, AMD IBS Memory access samples contain: The instruction pointer The address accessed How many core clock cycles elapsed during the access Where in the memory hierarchy the address was resolved (e.g. L1 cache, Local RAM, Remote RAM) We need a way to meaningfully interpret these samples

3 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Can get these from tools Need help from app Adding Context Can better understand memory references with appropriate context Contexts include: – The code – The node hardware topology – Calling context (call path) – The application (e.g. fluid dynamics) Other work by Liu & Mellor-Crummey has looked at mapping latency & access patterns to particular variables, call paths, and access patterns.

4 Lawrence Livermore National Laboratory LLNL-PRES- 657922 We can already get coarse-grained application context for some codes Physics data is available in data structures Time steps are easy to mark in the code Per-process performance – easy to get – just turn on counters at the beginning of the run – read them periodically. What if we want finer-grained attribution? – How to tie measurements to data structures? – How to slice and dice the data? Aluminum FLOP/s per MPI process

5 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Node topology is easy to get, but not shown clearly. PEBS provides metadata for node topology Want to highlight connections clearly to show: – Load distribution – Bandwidth – Resource contention Existing visualization from hwloc (right) – Does not scale – Clutters connections between components

6 Lawrence Livermore National Laboratory LLNL-PRES- 657922 We have developed a measurement tool for collecting detailed context *SMT: (Semantic Memory Tree) data structure used to map callbacks sampled instruction operands Use PEBS sampling for hardware information Supplement with application instrumentation for mapping addresses to physical coordinates *

7 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Currently the developer has to instrument the application manually Add calls to get metadata for allocated objects: 1.Label string 2.Start and end addresses 3.Size of each element 4.Number of elements 5.Callback to map address to physical coordinates Metadata must be provided by the programmer – Could easily be implemented in libraries – Lots of common mesh libraries would be interesting for this.

8 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Instrumentation Specify Data Objects Add additional semantic attributes and define attribution function (optional)

9 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Semantic Memory Tree S emantic M emory R ange T ree Instrumentation Velocity Pressure Temp Density Binary Search Tree 0x0 F 0xF 6 0x0 F 0x8 0 0xA 2 0xF 6 0x0 F 0x2 0 0x4 0 0x8 0 0xA 2 0xC 2 0xE 0 0xF 6 Data Buffers Address Ranges Addresses Application Domain Record Performance Data in Application Domain S emantic M emory R ange T ree Instrumentation Velocity Pressure Temp Density Binary Search Tree 0x0 F 0xF 6 0x0 F 0x8 0 0xA 2 0xF 6 0x0 F 0x2 0 0x4 0 0x8 0 0xA 2 0xC 2 0xE 0 0xF 6 Data Buffers Address Ranges Addresses Application Domain Record Performance Data in Application Domain Velocity Pressure Temp Density Binary Search Tree 0x0 F 0xF 6 0x0 F 0x8 0 0xA 2 0xF 6 0x0 F 0x2 0 0x4 0 0x8 0 0xA 2 0xC 2 0xE 0 0xF 6 Address Ranges Binary Search Tree Velocity Pressure Temp Density 0x0F0xF6 0x0F0x80 0xA20xF6 0x0F0x20 0x400x800xA20xC2 0xE00xF6 Address Ranges Semantic Memory Ranges

10 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Lagrangian Hydrodynamics: LULESH 2D 3D 3D with mapped performance data

11 Lawrence Livermore National Laboratory LLNL-PRES- 657922 We have developed MemAxes, a tool for analyzing on-node memory performance Measurement component samples memory instructions We map latency information onto A) source code, B) node topology C) Pie chart shows percent of total latency selected D) Parallel coordinates view allows exploration of correlations

12 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Linked views clearly show on-node locality problems PIPE R Parallel coordinates view shows correlation between array index and core id in LULESH Linked node topology view shows data motion for highlighted memory operations A contiguous chunk of an array is initially split between threads on four cores Using an optimized affinity scheme, we improve locality Performance improved by 10% Default thread affinity with poor locality Optimized thread affinity with good locality

13 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Hyperion Thread/Core Binding Improved cache usage 44% less access cycles 10% total speedup

14 Lawrence Livermore National Laboratory LLNL-PRES- 657922 Future work Back-port perf_events API to production TOSS 2 kernel – Currently unable to do fine-grained memory sampling on production machines due to PMU access limits – Affects some Intel thread tools as well More detailed architecture mapping – Sandy Bridge LLC ring interconnect information? – Other node architecture features? Instrument AMR libraries for proper context attribution – Study per-patch memory behavior – Study blocking behavior of solvers How to query large instruction traces effectively?


Download ppt "Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by."

Similar presentations


Ads by Google