A Framework for Particle Advection for Very Large Data Hank Childs, LBNL/UCDavis David Pugmire, ORNL Christoph Garth, Kaiserslautern David Camp, LBNL/UCDavis.

Slides:

Advertisements

Similar presentations

A Framework for Particle Advection for Very Large Data

Advertisements

Practical techniques & Examples

Hank Childs Lawrence Berkeley National Laboratory /

1 Slides presented by Hank Childs at the VACET/SDM workshop at the SDM Center All-Hands Meeting. November 26, 2007 Snoqualmie, Wa Work performed under.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs Lawrence Berkeley National Laboratory / University.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Algorithms and Problem Solving-1 Algorithms and Problem Solving.

Nuclear Energy Work Hank Childs & Christoph Garth April 15, 2010.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.

Workload Management Massimo Sgaravatto INFN Padova.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

VisIt: a visualization tool for large turbulence simulations  Outline Success stories with turbulent simulations Overview of VisIt project 1 Hank Childs.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006outline.1 ITCS 4145/5145 Parallel Programming (Cluster Computing) Fall 2006 Barry Wilkinson.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Clone-Cloud. Motivation With the increasing use of mobile devices, mobile applications with richer functionalities are becoming ubiquitous But mobile.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

VACET: Deploying Technology for Visualizing and Analyzing Astrophysics Simulations Author May 19, 2009.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Efficient Visualization and Analysis of Very Large Climate Data Hank Childs, Lawrence Berkeley National Laboratory December 8, 2011 Lawrence Livermore.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Hank Childs, University of Oregon Lecture #6 CIS 410/510: Advection (Part 1)

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

VisIt is an open source, richly featured, turn-key application for large data.  Used by:  Visualization experts  Simulation code developers  Simulation.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

How do we handle algorithms that aren’t embarrassingly parallel? P0 P1 P3 P2 P8 P7P6 P5 P4 P9 Pieces of data (on disk) ReadProcessRender Processor 0 ReadProcessRender.

VAPoR: A Discovery Environment for Terascale Scientific Data Sets Alan Norton & John Clyne National Center for Atmospheric Research Scientific Computing.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Hank Childs, University of Oregon Large Data Visualization.

Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

1 27B element Rayleigh-Taylor Instability (MIRANDA, BG/L) VisIt: a visualization tool for large turbulence simulations Large data requires special techniques.

VisIt : A Tool for Visualizing and Analyzing Very Large Data Hank Childs, Lawrence Berkeley National Laboratory December 13, 2010.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Zachary Starr Dept. of Computer Science, University of Missouri, Columbia, MO 65211, USA Digital Image Processing Final Project Dec 11 th /16 th, 2014.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

VisIt Project Overview

Operating Systems (CS 340 D)

VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock

The University of Adelaide, School of Computer Science

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Operating Systems (CS 340 D)

Parallel Programming in C with MPI and OpenMP

Toward a Unified HPC and Big Data Runtime

Introduction to Apache

Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.

Algorithms and Problem Solving

Department of Computer Science, University of Tennessee, Knoxville

Parallel Programming in C with MPI and OpenMP

Portable Performance for Many-Core Particle Advection

In Situ Fusion Simulation Particle Data Reduction Through Binning

Extreme-Scale Distribution-Based Data Analysis

Presentation transcript:

A Framework for Particle Advection for Very Large Data Hank Childs, LBNL/UCDavis David Pugmire, ORNL Christoph Garth, Kaiserslautern David Camp, LBNL/UCDavis Sean Ahern, ORNL Gunther Weber, LBNL Allen Sanderson, Univ. of Utah

Particle advection basics Advecting particles create integral curves Streamlines: display particle path (instantaneous velocities) Pathlines: display particle path (velocity field evolves as particle moves)

Particle advection is the duct tape of the visualization world Many thanks to Christoph Garth Advecting particles is essential to understanding flow!

Outline  Efficient advection of particles  A general system for particle-advection based analysis.

Advecting particles

Particle Advection Performance  N particles (P1, P2, … Pn), M MPI tasks (T1, …, Tm)  Each particle takes a variable number of steps, X1, X2, … Xn  Total number of steps is Σ Xi  We cannot do less work than this ( Σ Xi)  Goal: Distribute the Σ Xi steps over M MPI tasks such that problem finishes in minimal time  Sounds sort of like a bin-packing problem  … but particles don’t need to be tied to processors  But: big data significantly complicates this picture….  … data may not be readily available, introducing busywait.

Advecting particles Decomposition of large data set into blocks on filesystem ? What is the right strategy for getting particle and data together?

Strategy: load blocks necessary for advection Decomposition of large data set into blocks on filesystem Go to filesystem and read block

Decomposition of large data set into blocks on filesystem Strategy: load blocks necessary for advection This strategy has multiple benefits: 1)Indifferent to data size: a serial program can process data of any size 2)Trivial parallelization (partition particles over processors) BUT: redundant I/O (both over MPI tasks and within a task) is a significant problem. This strategy has multiple benefits: 1)Indifferent to data size: a serial program can process data of any size 2)Trivial parallelization (partition particles over processors) BUT: redundant I/O (both over MPI tasks and within a task) is a significant problem.

“Parallelize over Particles”  “Parallelize over Particles”: particles are partitioned over processors, blocks of data are loaded as needed.  Some additional complexities:  Work for a given particle (i.e. Xi) is variable and not known a priori: how to share load between processors dynamically?  More blocks than can be stored in memory: what is the best caching/purging strategy?

Strategy: parallelize over blocks and dynamically assign particles P1 P2 P4 P3 This strategy has multiple benefits: 1)Ideal for in situ processing. 2)Only load data once. BUT: busywait is a significant problem. This strategy has multiple benefits: 1)Ideal for in situ processing. 2)Only load data once. BUT: busywait is a significant problem.

Both parallelization schemes have serious flaws.  Two extremes: Parallelizing OverI/OEfficiency DataGoodBad ParticlesBadGood Parallelize over particles Parallelize over data Hybrid algorithms

The master-slave algorithm is an example of a hybrid technique.  Algorithm adapts during runtime to avoid pitfalls of parallelize-over-data and parallelize-over- particles.  Nice property for production visualization tools.  Implemented inside VisIt visualization and analysis package. D. Pugmire, H. Childs, C. Garth, S. Ahern, G. Weber, “Scalable Computation of Streamlines on Very Large Datasets.” SC09, Portland, OR, November, 2009

Master-Slave Hybrid Algorithm Divide processors into groups of N Uniformly distribute seed points to each group Master: - Monitor workload - Make decisions to optimize resource utilization Slaves: - Respond to commands from Master - Report status when work complete SlaveSlave SlaveSlave SlaveSlave MasterMaster SlaveSlave SlaveSlave SlaveSlave MasterMaster SlaveSlave SlaveSlave SlaveSlave MasterMaster SlaveSlave SlaveSlave SlaveSlave MasterMaster P0P1P2P3P0P1P2P3P4P5P6P7P4P5P6P7P8P9P10P11P8P9P10P11P12P13P14P15P12P13P14P15

Master Process Pseudocode Master() { while ( ! done ) { if ( NewStatusFromAnySlave() ) { commands = DetermineMostEfficientCommand() for cmd in commands SendCommandToSlaves( cmd ) } What are the possible commands?

Commands that can be issued by master Master Slave Slave is given a streamline that is contained in a block that is already loaded 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds

Master Slave Slave is given a streamline and loads the block Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds

Master Slave Load Slave is instructed to load a block. The streamline in that block can then be computed. Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds

Master Slave Send to J Slave J Slave is instructed to send a streamline to another slave that has loaded the block Commands that can be issued by master 1. Assign / Loaded Block 2. Assign / Unloaded Block 3. Handle OOB / Load 4. Handle OOB / Send OOB = out of bounds

Master Process Pseudocode Master() { while ( ! done ) { if ( NewStatusFromAnySlave() ) { commands = DetermineMostEfficientCommand() for cmd in commands SendCommandToSlaves( cmd ) } * See SC 09 paper for details

Master-slave in action P0 P1 P2 P3 P4 IterationAction 0P0 reads B0, P3 reads B1 1P1 passes points to P0, P4 passes points to P3, P2 reads B0 0: Read Notional streamline example Notional streamline example 1: Pass 1: Read - When to pass and when to read? - How to coordinate communication? Status? Efficiently?

Algorithm Test Cases - Core collapse supernova simulation - Magnetic confinement fusion simulation - Hydraulic flow simulation

Workload distribution in parallelize-over-data Starvation

Workload distribution in parallelize-over- particles Too much I/O

Workload distribution in hybrid algorithm Just right

ParticlesData Hybrid Workload distribution in supernova simulation Parallelization by: Colored by processor doing integration

Astrophysics Test Case: Total time to compute 20,000 Streamlines Seconds Number of procs Uniform Seeding Non-uniform Seeding DataPart- icles Hybrid

Astrophysics Test Case: Number of blocks loaded Blocks loaded Number of procs Data Part- icles Hybrid Uniform Seeding Non-uniform Seeding

Summary: Master-Slave Algorithm  First ever attempt at a hybrid parallelization algorithm for particle advection  Algorithm adapts during runtime to avoid pitfalls of parallelize-over-data and parallelize-over- particles.  Nice property for production visualization tools.  Implemented inside VisIt visualization and analysis package.

Outline  Efficient advection of particles  A general system for particle-advection based analysis.

Goal  Efficient code for a variety of particle advection based techniques  Cognizant of use cases with >>10K particles.  Need handling of every particle, every evaluation to be efficient.  Want to support diverse flow techniques: flexibility/extensibility is key.

Motivating examples of PICS  FTLE  Stream surfaces  Streamline  Poincare  Statistics based analysis  + more

Design  PICS filter: parallel integral curve system  Execution:  Instantiate particles at seed locations  Step particles to form integral curves Analysis performed at each step Termination criteria evaluated for each step  When all integral curves have completed, create final output

Design  Five major types of extensibility:  Initial particle locations?  How do you evaluate velocity field?  How do you advect particles?  How to parallelize?  How do you analyze the particle paths?

Inheritance hierarchy avtPICSFilter Streamline Filter Your derived type of PICS filter avtIntegralCurve avtStreamlineIC Your derived type of integral curve  We disliked the “matching inheritance” scheme, but this achieved all of our design goals cleanly.

#1: Initial particle locations  avtPICSFilter::GetInitialLocations() = 0;

#2: Evaluating velocity field avtIVPField avtIVPVTKField avtIVPVTK- TimeVarying- Field avtIVPM3DC1 Field avtIVP- HigherOrder- Field Doesn’t exist IVP = initial value problem

#3: How do you advect particles? avtIVPSolver avtIVPDopri5 avtIVPEuler avtIVPLeapfrog avtIVP- M3DC1Integrator IVP = initial value problem

#4: How to parallelize? avtICAlgorithm avtParDomIC- Algorithm (parallel over data) avtParDomIC- Algorithm (parallel over data) avtSerialIC- Algorithm (parallel over seeds) avtSerialIC- Algorithm (parallel over seeds) avtMasterSlave- ICAlgorithm

#5: How do you analyze particle path?  avtIntegralCurve::AnalyzeStep() = 0;  All AnalyzeStep will evaluate termination criteria  avtPICSFilter::CreateIntegralCurveOutput( std::vector &) = 0;  Examples:  Streamline: store location and scalars for current step in data members  Poincare: store location for current step in data members  FTLE: only store location of final step, no-op for preceding steps  NOTE: these derived types create very different types of outputs.

Putting it all together PICS Filter avtICAlgorithm avtIVPSolver avtIVPField Vector< avtIntegral-Curve> Vector< avtIntegral-Curve> Integral curves sent to other processors with some derived types of avtICAlgorithm. ::CreateInitialLocations() = 0; ::AnalyzeStep() = 0;

VisIt is an open source, richly featured, turn-key application for large data.  Used by:  Visualization experts  Simulation code developers  Simulation code consumers  Popular  R&D 100 award in 2005  Used on many of the Top500  >>>100K downloads 217 pin reactor cooling simulation Run on ¼ of Argonne BG/P Image credit: Paul Fischer, ANL 1 billion grid points / time slice

Final thoughts…  Summary:  Particle advection is important for understanding flow and efficiently parallelizing this computation is difficult.  We have developed a freely available system for doing this analysis for large data.  Documentation:  (PICS)  (VisIt)  Future work:  UI extensions, including Python  Additional analysis techniques (FTLE & more)

Acknowledgements  Funding: This work was supported by the Director, Office of Science, Office and Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231 through the Scientific Discovery through Advanced Computing (SciDAC) program's Visualization and Analytics Center for Enabling Technologies (VACET).  Program Manager: Lucy Nowell  Master-Slave Algorithm: Dave Pugmire (ORNL), Hank Childs (LBNL/UCD), Christoph Garth (Kaiserslautern), Sean Ahern (ORNL), and Gunther Weber (LBNL)  PICS framework: Hank Childs (LBNL/UCD), Dave Pugmire (ORNL), Christoph Garth (Kaiserslautern), David Camp (LBNL/UCD), Allen Sanderson (Univ of Utah)