BG/Q vs BG/P—Applications Perspective from Early Science Program Timothy J. Williams Argonne Leadership Computing Facility 2013 MiraCon Workshop Monday.

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Meet the Catalysts Yuri Alexeev, James Osborn, Katherine Riley.
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Exascale? No problem! Paul Henning.
CS 524 – High- Performance Computing Outline. CS High-Performance Computing (Wi 2003/2004) - Asim LUMS2 Description (1) Introduction to.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
1 Discussions on the next PAAP workshop, RIKEN. 2 Collaborations toward PAAP Several potential topics : 1.Applications (Wave Propagation, Climate, Reactor.
Domain Applications: Broader Community Perspective Mixture of apprehension and excitement about programming for emerging architectures.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Scientific Computing Topics for Final Projects Dr. Guy Tel-Zur Version 2,
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
QCD Project Overview Ying Zhang September 26, 2005.
What is Computational Science? Shirley Moore CPS 5401 August 27,
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
Physics Steven Gottlieb, NCSA/Indiana University Lattice QCD: focus on one area I understand well. A central aim of calculations using lattice QCD is to.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Scheduler Activations: Effective Kernel Support for the User-level Management of Parallelism Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,
Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
Argonne Leadership Computing Facility ALCF at Argonne  Opened in 2006  Operated by the Department of Energy’s Office of Science  Located at Argonne.
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.
CS 732: Advance Machine Learning
1 1  Capabilities: PCU: Communication, threading, and File IO built on MPI APF: Abstract definition of meshes, fields, and their algorithms GMI: Interface.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.
Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.
SciDAC CCSM Consortium: Software Engineering Update Patrick Worley Oak Ridge National Laboratory (On behalf of all the consorts) Software Engineering Working.
A Tool for Chemical Kinetics Simulation on Accelerated Architectures
STUDY OF PARALLEL MONTE CARLO SIMULATION TECHNIQUES
Programming Models for SimMillennium
for more information ... Performance Tuning
BlueGene/L Supercomputer
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
How Efficient Can We Be?: Bounds on Algorithm Energy Consumption
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

BG/Q vs BG/P—Applications Perspective from Early Science Program Timothy J. Williams Argonne Leadership Computing Facility 2013 MiraCon Workshop Monday 3/4/2013 Session: 3:45-4:30pm

2 BG/P applications should run, unchanged, on BG/Q — faster

3  16 projects –Large target allocations –Postdoc  Proposed runs between Mira acceptance and start of production  2 billion core-hours to burn in a few months First in Mira Queue: Early Science Program

4 16 ESP Projects Algorithms/Methods Structured Grids Unstructured Grids FFT Dense Linear Algebra Sparse Linear Algebra Particles/N-Body Monte Carlo 7 National Lab PIs 9 University PIs 7 National Lab PIs 9 University PIs Science Areas Astrophysics Biology CFD/Aerodynamics Chemistry Climate Combustion Cosmology Energy Fusion Plasma Geophysics Materials Nuclear Structure

5  Next 2 slides, efforts characterized as S=small, M=medium, L=large –S : zero – few days of effort, modifications to 0% - 3% of existing lines of code –M : few weeks of effort, modifications to 3% - 10% of existing lines of code –S : few months of effort, modifications beyond 10% of existing lines of code  Ranking based on estimates by people who actually did the work How Much Effort to “Port” to BG/Q?

6 How Much Effort? PI/affiliation Code(s) Magnitude of Changes Nature of Changes Balaji/GFDLHIRAM L Improve OpenMP implementation, reformulate divergence-damping Curtiss/ANLQMCPACK M S to port, L to use QPX in key kernels; plan: nested OpenMP Frouzakis/ETHNek5000 S Optimized small matrix-matrix multiply using QPX Gordon/Iowa StateGAMESS M 64-bit addressing, thread integral kernels with OpenMP Habib/ANL, UCHACC M Short-range-force only: tree code Harrison/ORNLMADNESS S Threading runtime tuning Kernel tuning to use QPX Jansen/U ColoradoPHASTA S Unchanged MPI-only performs well; OpenMP threaded in testing Jordan/USCAWP-ODC, SORD S, M None, Threading

7 How Much Effort? PI/affiliation Code(s) Magnitude of Changes Nature of Changes Khoklov/UCHSCD S Tune OpenMP parameters, link optimized math libs Lamb/UCFLASH/RTFlame S OpenMP threading Mackenzie/Fermila b MILC, Chroma, CPS L Full threading, QPX intrinsics/assembler, kernel on SPI comm. Moser/UTexasPSDNS S Compile dependency libs, add OpenMP directives for threading Pieper/ANLGFMC S Tune no. threads & ranks. Roux/UCNAMD, Charm++ L Threads, PAMI implementation of Charm++ Tang/PrincetonGTC S Improve OpenMP implementation Voth/UC, ANL NAMD, LAMMPS, RAPTOR M OpenMP threads & serial optimizations in RAPTOR/LAMMPS

8  Threads  Communications –One-sided –Beneath MPI  Kernel optimizations –QPX –Code restructuring  Parallel I/O  Algorithms targeting Blue Gene architecture  BG/Q Tuned libraries –Linear algebra –Math functions –FFTs Areas of Effort