In Situ Sampling of a Large-Scale Particle Simulation Jon Woodring Los Alamos National Laboratory DOE CGF 2011 1.

Slides:



Advertisements
Similar presentations
Presentation at Society of The Query conference, Amsterdam November 13-14, 2009 (original title: Learning from Google: software design as a methodology.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
OSE meeting GODAE, Toulouse 4-5 June 2009 Interest of assimilating future Sea Surface Salinity measurements.
Chapter 3 Image Enhancement in the Spatial Domain.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Why Environment Matters more massive halos. However, it is usually assumed in, for example, semianalytic modelling that the merger history of a dark matter.
Analyzing the Results of a Simulation and Estimating Errors Jason Cooper.
Bit Depth and Spatial Resolution SIMG-201 Survey of Imaging Science © 2002 CIS/RIT.
School of Computing Science Simon Fraser University
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
School of Computing Science Simon Fraser University
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Mark Duchaineau Data Science Group April 3, 2001 Multi-Resolution Techniques for Scientific Data Visualization.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Highlights Lecture on the image part (10) Automatic Perception 16
Chapter 12: Simulation and Modeling Invitation to Computer Science, Java Version, Third Edition.
Trevor McCasland Arch Kelley.  Goal: reduce the size of stored files and data while retaining all necessary perceptual information  Used to create an.
Digital Image Characteristic
ASCR Scientific Data Management Analysis & Visualization PI Meeting Exploration of Exascale In Situ Visualization and Analysis Approaches LANL: James Ahrens,
Data Mining Techniques
Computer Vision. DARPA Challenge Seeks Robots To Drive Into Disasters. DARPA's Robotics Challenge offers a $2 million prize if you can build a robot capable.
Digital Images The digital representation of visual information.
CS 1308 Computer Literacy and the Internet. Creating Digital Pictures  A traditional photograph is an analog representation of an image.  Digitizing.
Chapter 12: Simulation and Modeling
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Rendering Adaptive Resolution Data Models Daniel Bolan Abstract For the past several years, a model for large datasets has been developed and extended.
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
1 Smashing Peacocks Further: Drawing Quasi-Trees from Biconnected Components Daniel Archambault and Tamara Munzner, University of British Columbia David.
Scientific Writing Abstract Writing. Why ? Most important part of the paper Number of Readers ! Make people read your work. Sell your work. Make your.
Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.
June 29 San FranciscoSciDAC 2005 Terascale Supernova Initiative Discovering New Dynamics of Core-Collapse Supernova Shock Waves John M. Blondin NC State.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Mark Rast Laboratory for Atmospheric and Space Physics Department of Astrophysical and Planetary Sciences University of Colorado, Boulder Kiepenheuer-Institut.
Lecture 7: Intro to Computer Graphics. Remember…… DIGITAL - Digital means discrete. DIGITAL - Digital means discrete. Digital representation is comprised.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Lecture 07: Dealing with Big Data
Visualizing Massive Multi-Digraphs James Abello Jeffrey Korn Information Visualization Research Shannon Laboratories, AT&T Labs-Research All the graphs.
Large Scale Time-Varying Data Visualization Han-Wei Shen Department of Computer and Information Science The Ohio State University.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Digital Image Processing
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Using Excel to Graph Data Featuring – Mean, Standard Deviation, Standard Error and Error Bars.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
A novel approach to visualizing dark matter simulations
VisIt Project Overview
Database management system Data analytics system:
Digital Data Format and Storage
Volume 94, Issue 1, Pages (January 2008)
Tamas Szalay, Volker Springel, Gerard Lemson
So far we have covered … Basic visualization algorithms
Ray-Cast Rendering in VTK-m
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Beyond Machine Learning - What Is Hidden In Your Data
CSc4730/6730 Scientific Visualization
Finite Element Surface-Based Stereo 3D Reconstruction
TeraScale Supernova Initiative
Emulator of Cosmological Simulation for Initial Parameters Study
Stefan Oßwald, Philipp Karkowski, Maren Bennewitz
In Situ Fusion Simulation Particle Data Reduction Through Binning
Extreme-Scale Distribution-Based Data Analysis
Presentation transcript:

In Situ Sampling of a Large-Scale Particle Simulation Jon Woodring Los Alamos National Laboratory DOE CGF

Exascale I/O Bandwidth compared to Compute Rate is Imbalanced One of the largest challenges for analysis at exascale (and currently) is I/O bandwidth, both for disks and networks Roadrunner Universe MC run (64 billion particles) generates 2.3 TB per time slice Data Science at Scale 2 Compute (green) is growing faster than disk bandwidth (red) For x flops it generates x/c amount of data, where c is constant

Implicit Visual Downsampling through Massively Parallel Visualization Lack of display bandwidth compared to data sizes This data on the right aren’t even “large” – it’s (16 million points) which is 3 orders of magnitude smaller than (64 billion) Need to use level-of- detail (multi-scale) Data Science at Scale 3 Can’t see the structure in a cosmology data set because there are more data point than pixels

In-Situ Visualization and Analysis In-situ visualization and analysis performs some analysis operations while a simulation is running The trick is trying to figure out what data, images, or features to save for post-analysis For the run, they only saved the “halos” Data Science at Scale 4 Visualization of dark matter clusters from a cosmology simulation. Clusters can be calculated as the simulation is running, greatly reducing the necessary storage space, instead of writing the full resolution data to storage and calculating the clusters in post-analysis.

Quantified Data Scaling at Exascale To adapt to bandwidth limitations, we focus on managing data for analysis ▫Use simulation-time analysis, to explicitly reduce data at compute run-time, such as using statistical sampling to store data samples ▫Use quantifiable data reduction, such as recording of original statistical variance when performing data reduction This work is in conjunction with LANL statistics group (CCS-6) Data Science at Scale 5

Benefits and Drawbacks of In-Situ Visualization and Analysis Benefits ▫Data can be scaled down to a compact form, such as images, features, clusters, and samples ▫By scaling down the data, the simulation I/O time can be budgeted in a fine-grain fashion Drawbacks ▫The types of analysis to be generated at run-time has to be decided in advance: which plots to calculate, which features to extract, and which data to save for post-analysis ▫Some simulation time has to be spent in analysis Data Science at Scale 6

Try to Defer the Decision: Use Statistical Sampling to Store “Raw” Data We utilize statistical random sampling as a raw data scaling operation for interactive post- analysis and visualization Random sampling is applied, as the simulation is running, to store a smaller sample The implementation is slightly more involved than simple random sampling (slides to come) Data Science at Scale 7 Raw cosmology simulation data Sampled cosmology data

Benefits of Random Sampling as an In-Situ Data Scaling Algorithm Most analysis operations can be applied on the sample data in post-analysis, instead of deciding the the analyses in advance Storage space and bandwidth can be fine-grain to make decisions on scaling I/O utilization Provides a data representation that is unbiased for statistical estimators, e.g., mean Increases the time fidelity in post-analysis by increasing the number of available time slices Can quantify the difference between sample data and original full-resolution data Data Science at Scale 8

Stratified Random Sampling via kd-tree and Parallel Decomposition Using parallel processor decomposition and kd-tree sorting, random samples are evenly spread across space: statistical estimators have lower variance This is “stratified random sampling” in the statistics literature (though, the statisticians had to verify) Data Science at Scale 9 Sorting data into sampling strata Acquiring a stratified random sample from decomposition Woodring et al., “In-situ Sampling of a Large-Scale Particle Simulation for Interactive Visualization and Analysis”, EuroVis 2011

Stratified Sampling maps directly to Interactive Visualization by Level-of-Detail The sampling strata decomposition naturally maps into a tree-based storage hierarchy that can be used to read the data Removes post-processing Level-of-detail (multi- resolution) scaling for interactive visualization Data Science at Scale 10 Movie: Level-of-detail interaction matching data density to pixels Movie: Streaming interaction showing block reads from disk Woodring et al., “In-situ Sampling of a Large-Scale Particle Simulation for Interactive Visualization and Analysis”, EuroVis 2011

Quantifying the Local Differences Between Samples and Original Full-Resolution Data Since the sampling algorithm is done in situ, we are able to measure the local differences between sample data augment sample data with original data statistics The full-resolution statistics provide an error bounds for the sample data and information about the “lost” data Data Science at Scale 11 Particles are colored by the local variance of the original data A cluster is identified in low density region Woodring et al., “In-situ Sampling of a Large-Scale Particle Simulation for Interactive Visualization and Analysis”, EuroVis 2011

Small Samples still reflect the Original Data: A 0.19% Sample and Full Resolution Data Data Science at Scale 12 Histograms of velocity components in cosmological particle data: red is sample data and black is original data Sample size is 32K (0.19%) and original size is 16M Average velocity component values over space in cosmological particle data: red is sample data and black is original data Sample size is 32K (0.19%) and original size is 16M Red is sample data, black is original simulation data - Both curves exist in all graphs, but the curve occlusion is reversed on top slides compared to bottom slides

Complex Analysis of Sample Data Compared to Original Simulation Data We are also able to perform many other analyses on “raw” sample data: here, we use a domain-specific “halo” (cluster) analysis on the sample data The types of analysis that can be used is completely domain and analysis dependent, and has to be tested on a case-by-case basis Data Science at Scale 13 Histogram of cluster analysis in cosmology data for different sample sizes: Original size is 16M (black) with 3 samples: Red is 0.19% sample, green is 1.6% sample, and blue is 12.5% sample

Measuring the Performance of Statistical Sampling with a Simulation We tested the system performance characteristics of our sampling method with a (8 billion) particle (RoadRunner Universe MC 3 ) simulation per time slice Data Science at Scale 14 Different sample size configurations in our testing per time slice, where N is the original data size ( particles) (“GB” is short for gigabytes, “res” is short for resolution, “LOD” is short for level-of-detail)

Workflow Savings by Writing the Data directly into an Interactive Format Assuming we use the same simulation I/O time, we always save 6 minutes, per time step, from simulation to interactive post-analysis by directly writing data into a level-of-detail format ▫We remove one read and one write of the data in post-processing and it is immediately ready for visualization and analysis from the simulation Data Science at Scale 15 “Write full” is the time to write full resolution data from the simulation, while “read full” is the time is takes to read full resolution data (Cerrillos connected to a Panasas). “Lag save” is the time saved between simulation and post-analysis.

Simulation I/O Time Savings per Time Slice for Different Sample Sizes Writing less data through sampling significantly reduces the amount of time spent in I/O per time slice The green highlighted bars are instances of writing half of the original data size or less Data Science at Scale 16 Time taken in seconds to write a time slice for different sample configurations (sizes) Time savings in seconds to write a time slice for different sample configurations (sizes)

Trading Full Resolution Data for Sample Data Allows for Simulation Fine-Grain I/O Our in-situ sampling method allows the simulation to adaptively control I/O expenditure ▫Increase the number of time slices, retain high spatial resolution data, and fine-grain manipulation of I/O utilization Data Science at Scale 17 The ratio of time slices that can be stored in the same time that it takes to write non-LOD full resolution data. For example, 100 time slices can be converted to low-resolution sample slices and 20 full resolution slices.

Future Research for Data Sampling Methods at Exascale Scientific Computing Exploring other methods for sampling ▫Outlier sampling ▫Sampling in high-dimensional space ▫Temporally coherent random sampling ▫Sampling on structured and unstructured grids ▫Feature and data mining driven sampling Evaluating reconstruction methods for sampled data Tracking and calculating error in analysis tools Additional scaling knobs: precision, variables… Data Science at Scale 18