ParaTools ThreadSpotter ParaTools, Inc.

Slides:

Advertisements

Similar presentations

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Intel® performance analyze tools Nikita Panov Idrisov Renat.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

1 Overview 2 Cache entry structure 3 mapping function 4 Cache hierarchy in a modern processor 5 Advantages and Disadvantages of Larger Caches 6 Implementation.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CMPE 421 Parallel Computer Architecture

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Virtual Memory So, how is this possible?

Soner Onder Michigan Technological University

Cache Memory and Performance

Reducing Hit Time Small and simple caches Way prediction Trace caches

Associativity in Caches Lecture 25

Cache Memories CSE 238/2038/2138: Systems Programming

COMBINED PAGING AND SEGMENTATION

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

The Hardware/Software Interface CSE351 Winter 2013

The University of Adelaide, School of Computer Science

Lecture 13: Reduce Cache Miss Rates

CSCI206 - Computer Organization & Programming

Appendix B. Review of Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Architecture Background

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

Chapter 8 Arrays Objectives

CS 105 Tour of the Black Holes of Computing

CSE 153 Design of Operating Systems Winter 2018

Flow Path Model of Superscalars

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Chapter 8 Arrays Objectives

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Adapted from slides by Sally McKee Cornell University

How can we find data in the cache?

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

COMP60621 Fundamentals of Parallel and Distributed Systems

Computer Architecture

CS-447– Computer Architecture Lecture 20 Cache Memories

Virtual Memory Prof. Eric Rotenberg

Chapter 8 Arrays Objectives

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Memory System Performance Chapter 3

CSE 153 Design of Operating Systems Winter 2019

Cache Performance Improvements

COMP60611 Fundamentals of Parallel and Distributed Systems

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Memory Principles.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

ParaTools ThreadSpotter ParaTools, Inc. John C. Linford et al. jlinford@paratools.com Army HPC User Group Review 18 May 2017, ARL

ParaTools ThreadSpotter Shrinking cache memory per core is causing some applications to spend over half their runtime waiting for data to arrive in cache. ThreadSpotter can: Analyze memory bandwidth and latency, data locality and thread communications. Identify specific issues and pinpoints troublesome areas in source code. Provide guidance towards a resolution. Supports mixed language distributed memory applications like CREATE-AV HELIOS and KESTREL. Integrates with the TAU Performance System®. Copyright © ParaTools, Inc.

Cache Capacity/Latency/BW Latency (cycles) Copyright © ParaTools, Inc.

... Intel i7 L2 $ 256kB X-bar L3 $ 24MB QuickPath Interconnect (QPI) CPU, 2 thr D$ 64kB I$ L2 $ 256kB CPU, 2 thr. X-bar L3 $ 24MB QuickPath Interconnect (QPI) 4 x DDR-3 CPU ... 8 cores x 2 threads Copyright © ParaTools, Inc.

Typical Cache Implementation Generic Cache: Addr [63..0] MSB LSB AT S Data = 64B Cacheline, here 64B: ... mux Hit Sel way ”6” SRAM: Cache Set: tag index Associativity = 8-way L3 $ 24MB L2 $ 256kB D1 ¢ 64kB I1 ¢ 64kB Copyright © ParaTools, Inc.

ThreadSpotter Reports Army HPC User Group Review ThreadSpotter Reports Copyright © ParaTools, Inc.

Report Index for 2040 MPI Ranks Copyright © ParaTools, Inc.

Report Cover page No memory bandwidth issues Data locality issues identified Copyright © ParaTools, Inc.

Global statistics Copyright © ParaTools, Inc.

ThreadSpotter Concepts Issue A problem or opportunity to improve performance Presented with backing statistics and source navigation Online help to explain details Loop An instruction cycle usually corresponding to a code loop Instruction group A collection of instructions touching related data Copyright © ParaTools, Inc.

ParaTools ThreadSpotter report: Issues Copyright © ParaTools, Inc.

ParaTools ThreadSpotter report: Loops Copyright © ParaTools, Inc.

ParaTools ThreadSpotter report: Details Copyright © ParaTools, Inc.

Metrics as a function of cache size Fetch ratio Memory operations that cause a data transfer to/from RAM Miss ratio Memory operations that stall due to cache misses. Fetch utilization Fraction of the data loaded into the cache that are actually used Copyright © ParaTools, Inc.

ParaTools ThreadSpotter report: Source annotation Copyright © ParaTools, Inc.

Report Vocabulary Miss ratio: What is the likelihood that a memory access will miss in a cache? Miss rate: Misses per unit, e.g. per-second or per-1000-instructions Fetch ratio/rate: What is the likelihood that a memory access will cause a fetch to the cache (including HW prefetching) Fetch utilization: What fraction of a cacheline was used before it got evicted Writeback utilization: What fraction of a cacheline written back to memory contains dirty data Communication utilization: What fraction of a communicated cacheline is ever used? Copyright © ParaTools, Inc.

ParaTools ThreadSpotter report: help Copyright © ParaTools, Inc.

Example: Temporal Blocking Army HPC User Group Review Example: Temporal Blocking Gaussian Elimination with Pivoting Copyright © ParaTools, Inc.

Overview of the forward elimination stage for i=1 to n-1 find pivotPos in column i if pivotPos ≠ i exchange rows(pivotPos,i) end if for j=i+1 to n A(i,j) = A(i,j)/A(i,i) end for j for j=i+1 to n+1 for k=i+1 to n A(k,j)=A(k,j)-A(k,i)×A(i,j) end for k end for i !$omp parallel do private ( i ,j ) Copyright © ParaTools, Inc.

First approach speed up nThreads ParaTools ThreadSpotter report Copyright © ParaTools, Inc.

Copyright © ParaTools, Inc.

Copyright © ParaTools, Inc.

What went wrong! For each prepared pivot, the whole matrix is accessed. The algorithm requires pivots to be calculated in order. Repeated eviction of the matrix’ cache lines. Copyright © ParaTools, Inc.

What went wrong! For each prepared pivot, the whole matrix is accessed. The algorithm requires pivots to be calculated in order. Repeated eviction of the matrix’ cache lines. Making things right! Observation: Each column is an accumulation of eliminations using previous columns! Temporal Blocking Advice says: Use each column many times before it gets evicted → Arrange code to make more pivots available! Copyright © ParaTools, Inc.

Blocking GE for k=1 to n-1, step C BlockEnd=min(k+C-1,n) Pivots array The row exchange turned into a two- element swap before column elimination C Rest of columns for k=1 to n-1, step C BlockEnd=min(k+C-1,n) GE on A(k:n,k:BlockEnd) & Store C pivots’ positions for each column j after BlockEnd for i=k to BlockEnd swap using pivots(i) elimination i on j end for i end for each j End for k !$omp parallel do private ( i ,j ) 1 2 Copyright © ParaTools, Inc.

Speed up w.r.t sequential time 6.22x nThreads Copyright © ParaTools, Inc.

Command Line Usage Army HPC User Group Review Copyright © ParaTools, Inc.

PTTS Integration with TAU tau_exec --ptts Sampling (sample_ts) Generates sample files named ptts/sample.#.smp Issue Identification (report_ts) Generates report files named ptts/report.#.tsr Reporting (view-static_ts ) Generates node_# directories with html reports Each rank produces log files for each step sample_ts.#.log, report_ts.#.log, and view-static_ts.#.log Copyright © ParaTools, Inc.

PTTS Integration with TAU Flag Description -ptts Use PTTS to generate performance data. -ptts-post Skip application sampling and post-process existing sample files. Useful for analyzing performance at each cache layer, or predicting performance on other architectures. -ptts-num=<N> Indicate number of MPI ranks if the size of the MPI communicator cannot be automatically detected, e.g. on Cray. -ptts-sample-flags=<flags> Additional flags to pass to the sample_ts command. Can also be specified in the TAU_TS_SAMPLE_FLAGS environment variable. -ptts-report-flags=<flags> Additional flags to pass to the report_ts command. Can also be specified in the TAU_TS_REPORT_FLAGS environment variable. Copyright © ParaTools, Inc.

Sampler settings: Start & Stop Controlling when to start and stop sampling Start conditions delay, -d <seconds> --start-at-function <function name> --start-at-address <0x1234123> --start-at-ignore <pass count> Stop conditions Duration, -t <seconds> --stop-at-function <function name> --stop-at-address <0x123123> --stop-at-ignore <pass count> Copyright © ParaTools, Inc.

Report settings: Issue selection depth How long call stack to consider while differentiating instructions --depth 0 – just consider the PC --depth 1 – consider the PC and the site which called this function percentage cutoff. Suppress uninteresting issues --percentage 3 (% of fetches) Copyright © ParaTools, Inc.

Report settings: Source/debug source directory – look for source in other places Useful if the directories are not recorded in the debug information -s directory binary – look for binaries (containing debug information in other places) Useful if the binary has moved since sampling -b path-to-actual-binary debug directory (for debug information stored external to the elf file) -D directory debug level (varying degree of information) --debug-level 0-3 Copyright © ParaTools, Inc.