Presentation is loading. Please wait.

Presentation is loading. Please wait.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Similar presentations


Presentation on theme: "John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit."— Presentation transcript:

1 John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit : Multi-platform Tools for Performance Analysis http://www.hipersoft.rice.edu/hpctoolkit/

2 2 The Big Picture Long-term: compiler and architecture research requires detailed performance understanding — identify performance bottlenecks in complex applications — understand the mismatch between application needs and architecture capabilities — automate strategies for performance improvement Short-term result: programmer-accessible tools for understanding application performance

3 http://www.hipersoft.rice.edu/hpctoolkit/ 3 Performance Analysis and Tuning Increasingly necessary — gap between typical and peak performance is growing Increasingly hard — complex architectures are harder to program effectively –deeply-pipelined microprocessors VLIW or superscalar –complex memory hierarchy non-blocking, multi-level caches — large-scale scientific applications pose challenges for tools

4 http://www.hipersoft.rice.edu/hpctoolkit/ 4 LACSI HPCToolkit Support large, multi-lingual applications — a mix of of Fortran, C, C++ — hundreds of thousands of lines, many procedures — external libraries Eliminate manual labor from run, analyze tune cycle — use optimized application binaries directly –no: manual instrumentation, build process changes, recompilation Platform, language, and compiler independence — emphasis on LANL ASC Platforms (Origin, AlphaServer, Opteron) — multiple data sources  cross platform comparisons Scalable data collection Effective presentation of analysis results — intuitive, top-down user interface –hierarchical program structure with loop level metrics

5 http://www.hipersoft.rice.edu/hpctoolkit/ 5 HPCToolkit System Overview application source application source profile execution performance profile performance profile binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer

6 http://www.hipersoft.rice.edu/hpctoolkit/ 6 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — launch unmodified, optimized application binaries — collect statistical profiles of events of interest

7 http://www.hipersoft.rice.edu/hpctoolkit/ 7 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — decode instructions and combine with profile data

8 http://www.hipersoft.rice.edu/hpctoolkit/ 8 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — extract loop nesting information from executables

9 http://www.hipersoft.rice.edu/hpctoolkit/ 9 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — synthesize new metrics by combining metrics — relate metrics, structure, and program source

10 http://www.hipersoft.rice.edu/hpctoolkit/ 10 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — support top-down analysis with interactive viewer — analyze results anytime, anywhere

11 http://www.hipersoft.rice.edu/hpctoolkit/ 11 HPCViewer Screenshot MetricsNavigation Annotated Source View

12 http://www.hipersoft.rice.edu/hpctoolkit/ 12 Impact on LANL Code Teams HPCToolkit deployed on Origin — improved SAGE by 2x on one example (see next slide) First performance workshop (Feb 03) — Feedback: needed on Q, smaller DB on large codes — Improvements: Sophisticated support for Alpha/Tru64 platform, new Java browser using compact database Second performance workshop (July 03) — Feedback: ready to use, binary analysis too slow on large codes — Improvement: sped up binary analysis on large codes by 30x HPCToolkit deployed on secure machines (July 03) — used to evaluate FLAG for ASCI burn code review (Aug 03) Ongoing interactions — Feedback: better support for shared libraries and Opteron — Improvement: new support for shared libraries installed on Q — Ongoing work: LANL/Rice collaboration for Opteron support

13 http://www.hipersoft.rice.edu/hpctoolkit/ 13 Sage Solver Performance Improvement

14 http://www.hipersoft.rice.edu/hpctoolkit/ 14 Future Collect and present dynamic context — what path gets us to expensive computations — accurate call-graph profiling of unmodified binaries — analysis and presentation of dynamic context to explain performance –solver is slow only when called on non-preconditioned matrices –MPI wait cost is incurred in the backsolve Statistical clustering — effective analysis of large collections of processes Performance diagnosis — why rather than what


Download ppt "John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit."

Similar presentations


Ads by Google