Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Slides:

Advertisements

Similar presentations

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Advertisements

Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.

Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.

Introduction CS 524 – High-Performance Computing.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.

1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Evaluation of Modern Parallel Vector Architectures Lenny Oliker.

BG/Q vs BG/P—Applications Perspective from Early Science Program Timothy J. Williams Argonne Leadership Computing Facility 2013 MiraCon Workshop Monday.

Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Generic GUI – Thoughts to Share Jinping Gwo EMSGi.org.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

I MAGIS is a joint project of CNRS - INPG - INRIA - UJF iMAGIS-GRAVIR / IMAG Efficient Parallel Refinement for Hierarchical Radiosity on a DSM computer.

UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Concurrency and Performance Based on slides by Henri Casanova.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Parallel Programming By J. H. Wang May 2, 2017.

Programming Models for SimMillennium

Is System X for Me? Cal Ribbens Computer Science Department

Parallel Programming in C with MPI and OpenMP

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Numerical Algorithms Quiz questions

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome

Criteria for Selecting Apps  Should be OS/Runtime “intensive”  Would highlight features/benefits of K42 I/O performance is important Uses multiple threads per node (show off fast threads) Uses small/asynch/active messages  Pragmatics Should be easily ported to K42 Part of an existing/funded effort Has a collaborator who: »Understands the applications in detail (e.g., and author) »Is interested in OS/Runtime issues

MADCAP: I/O Bound  For the Astrophysicists in the audience:  Microwave Anisotropy Dataset Computational Analysis Package  Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)  Anisotropies in the CMB contains early history of the Universe  Calculates maximum likelihood two-point angular correlation function  Recasts problem in dense linear algebra: ScaLAPACK Steps include: mat-mat, matrix-inv, mat-vec, Cholesky decomp, data redistribution  Portability:  Depends on ScaLAPACK, which is portable  Has been tuned and run on vectors, ccNuma, cluster  Developed at NERSC/LBNL by Julian Borrill  Part of application evaluation suite led by Leonid Oliker at LBNL Temperature anisotropies in CMB measured by Boomerang

MADCAP: Performance  Computation dominated by BLAS3: efficiency should be very high  But all systems sustain relatively low % peak  Reason: I/O is a major challenge:  Code Only partially ported due to code’s requirements of global file system  I/O performance is a limiting factor in above  Further work is required to: reduce I/O, remove system calls, and remove global file system requirements  Detailed analysis presented HiPC 2004 by Carter, Borrill, and Oliker  P Power 3Power4ESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak %1.529%4.132%2.227% %0.8116%1.923%2.016%

MADCAP: MADbench  MADbench is a “lightweight version of MADCAP” Retains operational complexity of full MADCAP Global files system requirement relaxed (to run on ESS)  MADbench is a proxy for full MADCAP Authors hope to perform work to reduce I/O costs in MADbench and then apply their changes to the full MADCAP.

UPC HPL: Multi-Threading  High Performance Linpack is also dominated by BLAS3 In spite of top500 number, surprisingly hard to tune Performance very sensitive to block size, total problem size, etc.  Recently written in UPC by Parry Husbands at LBNL Parallel extension of C that has several compilers Portable Berkeley compiler uses lightweight communication  UPC HPL written in an event-driven style User-level threads for tasks: factorization, matrix multiply, pivoting… Has been run with PTH and Posix Thread and hand-rolled  Performs well, e.g., on the X1 MSP at 64 processors: MPI HPL: Gflop/s (n=160,000) UPC HPL: Gflop/s (n=128,000)

Adaptive Mesh Refinement Comm Tiime  AMR is notoriously hard to scale due to communication cost  Mike Welcome at LBNL plans to build a one-sided comm version Will use overlapped/asynchronous communication on GASNet May use dynamic load balancing (remote task scheduling)

General Thoughts  This is just an initial set of suggestions  Choice influenced by existing collaborations at LBNL Please add your own ideas!  Tradeoffs on the particular choices Detailed performance info would be very useful in all of these The Madcap application is the most complete/real application Data for performance comparisons exist for Madcap and HPL UPC HPL is complete, but performance sensitivity to thread scheduler needs evaluation: seems to be an issue so far Too much BLAS3 with Madcap and HPL AMR is more challenging from performance standpoint, but it doesn’t yet exist