Slide-1 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Benchmarking Working Group Session Agenda 1:00-1:15David KoesterWhat Makes.

Slides:

Advertisements

Similar presentations

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

EEL6686 Guest Lecture February 25, 2014 A Framework to Analyze Processor Architectures for Next-Generation On-Board Space Computing Tyler M. Lovelly Ph.D.

100 Performance ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

Introduction  Data movement is a major bottleneck in data-intensive high performance computing  We propose a Fusion Active Storage System (FASS) to address.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

July Terry Jones, Integrated Computing & Communications Dept Fast-OS.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.

Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.

1 Acknowledgements Class notes based upon Patterson & Hennessy: Book & Lecture Notes Patterson’s 1997 course notes (U.C. Berkeley CS 152, 1997) Tom Fountain.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Cray Innovation Barry Bolding, Ph.D. Director of Product Marketing, Cray September 2008.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Slide-1 LACSI Extreme Computing MITRE ISIMIT Lincoln Laboratory This work is sponsored by the Department of Defense under Army Contract W15P7T-05-C-D001.

The Truth About Parallel Computing: Fantasy versus Reality William M. Jones, PhD Computer Science Department Coastal Carolina University.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.

Slide-1 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK)

Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Full and Para Virtualization

Performance Performance

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Interconnection network network interface and a case study.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.

CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

M U N - February 15, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

Parallel IO for Cluster Computing Tran, Van Hoai.

Concurrency and Performance Based on slides by Henri Casanova.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

4- Performance Analysis of Parallel Programs

Programming Models for SimMillennium

CSCE 212 Chapter 4: Assessing and Understanding Performance

Department of Computer Science University of California, Santa Barbara

Performance of computer systems

Performance of computer systems

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Slide-1 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Benchmarking Working Group Session Agenda 1:00-1:15David KoesterWhat Makes HPC Applications Challenging? 1:15-1:30Piotr LuszczekHPCchallenge Challenges 1:30-1:45Fred TracyAlgorithm Comparisons of Application Benchmarks 1:45-2:00Henry NewmanI/O Challenges 2:00-2:15Phil ColellaThe Seven Dwarfs 2:15-2:30Glenn LueckeRun-Time Error Detection Benchmark 2:30-3:00Break 3:00-3:15Bill MannSSCA #1 Draft Specification 3:15-3:30Theresa MeuseSSCA #6 Draft Specification 3:30-??Discussions — User Needs HPCS Vendor Needs for the MS4 Review HPCS Vendor Needs for the MS5 Review HPCS Productivity Team Working Groups

Slide-2 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory This work is sponsored by the Department of Defense under Army Contract W15P7T-05-C-D001. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. What Makes HPC Applications Challenging? David Koester, Ph.D January 2005 HPCS Productivity Team Meeting Marina Del Rey, CA

Slide-3 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Outline HPCS Benchmark Spectrum What Makes HPC Applications Challenging? –Memory access patterns/locality –Processor characteristics –Concurrency –I/O characteristics –What new challenges will arise from Petascale/s+ applications? Bottleneckology –Amdahl’s Law –Example: Random Stride Memory Access Summary

Slide-4 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory HPCS Benchmark Spectrum

Slide-5 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory HPCS Benchmark Spectrum What Makes HPC Applications Challenging? Full applications may be challenging due to –Killer Kernels –Global data layouts –Input/Output Killer Kernels are challenging because of many things that link directly to architecture Identify bottlenecks by mapping applications to architectures

Slide-6 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory What Makes HPC Applications Challenging? Memory access patterns/locality –Spatial and Temporal  Indirect addressing  Data dependencies Processor characteristics –Processor throughput (Instructions per cycle)  Low arithmetic density  Floating point versus integer –Special features  GF(2) math  Popcount  Integer division Concurrency –Ubiquitous for Petascale/s –Load balance I/O characteristics –Bandwidth –Latency –File access patterns –File generation rates Killer Kernels Global Data Layouts Killer Kernels Global Data Layouts Input/Output

Slide-7 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Cray “Parallel Performance Killer” Kernels KernelPerformance Characteristic RandomAccessHigh demand on remote memory No locality 3D FFTNon-unit strides High bandwidth demand Sparse matrix-vector multiplyIrregular, unpredictable locality Adaptive mesh refinementDynamic data distribution; dynamic parallelism Multi-frontal methodMultiple levels of parallelism Sparse incomplete factorizationAmdahl’s Law bottlenecks Preconditioned domain decomposition Frequent large messages Triangular solverFrequent small messages; poor ratio of computation to communication Branch-and-bound algorithmFrequent broadcast synchronization

Slide-8 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Killer Kernels Phil Colella —The Seven Dwarfs

Slide-9 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Mission Partner Applications How do mission partner applications relate to HPCS spatial/temporal view of memory? –Kernels? –Full applications? How do mission partner applications relate to HPCS spatial/temporal view of memory? –Kernels? –Full applications? HPCS Challenge Points HPCchallenge Benchmarks Memory Access Patterns/Locality

Slide-10 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Processor Characteristics Special Features Comparison of similar speed MIPS processors with and without –GF(2) math –Popcount Similar or better performance reported using Alpha processors (Jack Collins (NCIFCRF)) Codes –Cray-supplied library –The Portable Cray Bioinformatics Library by ARSC References – – Algorithmic speedup of 120x

Slide-11 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Concurrency Insert Cluttered VAMPIR Plot here

Slide-12 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory I/O Relative Data Latency ‡ Note: 11 orders of magnitude relative differences! ‡ Henry Newman (Instrumental)

Slide-13 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory I/O Relative Data Bandwidth per CPU ‡ Note: 5 orders of magnitude relative differences! ‡ Henry Newman (Instrumental)

Slide-14 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Strawman HPCS I/O Goals/Challenges 1 Trillion files in a single file system –32K file creates per second 10K metadata operations per second –Needed for Checkpoint/Restart files Streaming I/O at 30 GB/sec full duplex –Needed for data capture Support for 30K nodes –Future file system need low latency communication An envelope on HPCS Mission Partner requirements

Slide-15 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory HPCS Benchmark Spectrum Future and Emerging Applications Identifying HPCS Mission Partner efforts –10-20K processor — Teraflop/s scale applications –20-120K processor — Teraflop/s scale applications –Petascale/s applications –Applications beyond Petascale/s LACSI Workshop — The Path to Extreme Supercomputing –12 October 2004 – What new challenges will arise from Petascale/s+ applications?

Slide-16 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Outline HPCS Benchmark Spectrum What Makes HPC Applications Challenging? –Memory access patterns/locality –Processor characteristics –Parallelism –I/O characteristics –What new challenges will arise from Petascale/s+ applications? Bottleneckology –Amdahl’s Law –Example: Random Stride Memory Access Summary

Slide-17 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Bottleneckology –Where is performance lost when an application is run on an architecture? –When does it make sense to invest in architecture to improve application performance? –System analysis driven by an extended Amdahl’s Law  Amdahl’s Law is not just about parallel and sequential parts of applications! References: –Jack Worlton, "Project Bottleneck: A Proposed Toolkit for Evaluating Newly-Announced High Performance Computers", Worlton and Associates, Los Alamos, NM, Technical Report No.13,January 1988 –Montek Singh, “Lecture Notes — Computer Architecture and Implementation: COMP 206”, Dept. of Computer Science, Univ. of North Carolina at Chapel Hill, Aug 30, fall-04/lectures/lecture-2.ppt

Slide-18 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Lecture Notes — Computer Architecture and Implementation (5) ‡ ‡ Montek Singh (UNC)

Slide-19 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Lecture Notes — Computer Architecture and Implementation (6) ‡ ‡ Montek Singh (UNC)

Slide-20 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Lecture Notes — Computer Architecture and Implementation (7) ‡ Also works for Rate = Bandwidth! ‡ Montek Singh (UNC)

Slide-21 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Lecture Notes — Computer Architecture and Implementation (8) ‡ ‡ Montek Singh (UNC)

Slide-22 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Bottleneck Example (1) Combine stride 1 and random stride memory access –25% random stride access –33% random stride access Memory bandwidth performance is dominated by the random stride memory access SDSC MAPS on an IBM SP-3

Slide-23 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Bottleneck Example (2) Combine stride 1 and random stride memory access –25% random stride access –33% random stride access Memory bandwidth performance is dominated by the random stride memory access SDSC MAPS on a COMPAQ Alphaserver Amdahl’s Law [ 7000 / (7* ) ] = 2800 MB/s

Slide-24 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Bottleneck Example (2) Combine stride 1 and random stride memory access –25% random stride access –33% random stride access Memory bandwidth performance is dominated by the random stride memory access SDSC MAPS on a COMPAQ Alphaserver Amdahl’s Law [ 7000 / (7* ) ] = 2800 MB/s Some HPCS Mission Partner applications –Extensive random stride memory access –Some random stride memory access However, even a small amount of random memory access can cause significant bottlenecks! Some HPCS Mission Partner applications –Extensive random stride memory access –Some random stride memory access However, even a small amount of random memory access can cause significant bottlenecks!

Slide-25 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Outline HPCS Benchmark Spectrum What Makes HPC Applications Challenging? –Memory access patterns/locality –Processor characteristics –Parallelism –I/O characteristics –What new challenges will arise from Petascale/s+ applications? Bottleneckology –Amdahl’s Law –Example: Random Stride Memory Access Summary

Slide-26 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Summary (1) Memory access patterns/locality –Spatial and Temporal  Indirect addressing  Data dependencies Processor characteristics –Processor throughput (Instructions per cycle)  Low arithmetic density  Floating point versus integer –Special features  GF(2) math  Popcount  Integer division Parallelism –Ubiquitous for Petascale/s –Load balance I/O characteristics –Bandwidth –Latency –File access patterns –File generation rates What makes Applications Challenging! Expand this List as required Work toward consensus with –HPCS Mission Partners –HPCS Vendors Understand Bottlenecks Characterize applications Characterize architectures Expand this List as required Work toward consensus with –HPCS Mission Partners –HPCS Vendors Understand Bottlenecks Characterize applications Characterize architectures

Slide-27 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory HPCS Benchmark Spectrum What Makes HPC Applications Challenging? Full applications may be challenging due to –Killer Kernels –Global data layouts –Input/Output Killer Kernels are challenging because of many things that link directly to architecture Identify bottlenecks by mapping applications to architectures Impress upon the HPCS community to identify what makes the application challenging when using an existing Mission Partner application for a systems analysis in the MS4 review