Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

Pattern Matching against Distributed Datasets within DAME Andy Pasley University of York.
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Random Forest Predrag Radenković 3237/10
© Crown copyright Met Office ACRE working group 2: downscaling David Hein and Richard Jones Research funded by.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Grand Challenges Robert Moorhead Mississippi State University Mississippi State, MS 39762
Erik Anderson 1, James Ahrens 2, Gilbert Preston 3, Antonio Baptista 4, Claudio Silva 1 VisTrails: Applications in Scientific Visualization 1 SCI Institute.
Real Time Abnormal Motion Detection in Surveillance Video Nahum Kiryati Tammy Riklin Raviv Yan Ivanchenko Shay Rochel Vision and Image Analysis Laboratory.
Distribution Statement A. Approved for public release; distribution is unlimited. Test and Evaluation/Science and Technology Program Rapid Data Analyzer.
ASCR Scientific Data Management Analysis & Visualization PI Meeting Exploration of Exascale In Situ Visualization and Analysis Approaches LANL: James Ahrens,
In Situ Sampling of a Large-Scale Particle Simulation Jon Woodring Los Alamos National Laboratory DOE CGF
The Edinburgh Research Partnership in Engineering and Mathematics Heriot-Watt University EH14 4AS University of Edinburgh EH9 3LA ECOSSE.
Computer Aided Sustainability and Green Computing BRYAN IDDINGS.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Klamath Coho Integrated Modeling Framework (IMF)
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
David S. Ebert David S. Ebert Visual Analytics to Enable Discovery and Decision Making: Potential, Challenges, and.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Presented by ORNL Statistics and Data Sciences Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Examples of Computing Uses for Statisticians Data management : data entry, data extraction, data cleaning, data storage, data manipulation, data distribution.
Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Delivering Integrated, Sustainable, Water Resources Solutions Monte Carlo Simulation Robert C. Patev North Atlantic Division – Regional Technical Specialist.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Mark Rast Laboratory for Atmospheric and Space Physics Department of Astrophysical and Planetary Sciences University of Colorado, Boulder Kiepenheuer-Institut.
Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
SBSTA32 Informal Dialogue: Wednesday, June 3 rd 2010, Bonn, Germany Asia-Pacific Network for Global Change Research Dr. Andrew W. Matthews APN nFP/SPG.
Chapter 2 Fundamental Simulation Concepts
VAPoR: A Discovery Environment for Terascale Scientific Data Sets Alan Norton & John Clyne National Center for Atmospheric Research Scientific Computing.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Large Scale Time-Varying Data Visualization Han-Wei Shen Department of Computer and Information Science The Ohio State University.
Active Frame Selection for Label Propagation in Videos Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas.
Using NASA Resources to Promote Climate Literacy Dana Haine, MS UNC-Chapel Hill Institute for the Environment Principal Investigator, NC CLIMATE Fellows.
Document number Anticipated Impacts for FRRS Pilot Program ERCOT TAC Meeting September 7, 2012.
Information Visualization Theresa Nguyen 4/10/2001.
June 3-6, 2003E-Society Lisbon Automatic Metadata Discovery from Non-cooperative Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Optimizing the Energy Usage and Cognitive Value of.
Simulation Examples And General Principles Part 2
High throughput biology data management and data intensive computing drivers George Michaels.
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
DATA ACCESS and DATA MANAGEMENT CHALLENGES in CMS.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Extreme-Scale Distribution Based Data Analysis
for the Offline and Computing groups
Invitation to Computer Science 5th Edition
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Homogeneity Guided Probabilistic Data Summaries for Analysis and Visualization of Large-Scale Data Sets Ohio State University (Shen) Problem: Existing.
SDM workshop Strawman report History and Progress and Goal.
Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.
1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.
Usability of In Situ Generated PDFs for Post Hoc Analysis
Emulator of Cosmological Simulation for Initial Parameters Study
In Situ Fusion Simulation Particle Data Reduction Through Binning
Extreme-Scale Distribution-Based Data Analysis
Process Wind Tunnel for Improving Business Processes
Presentation transcript:

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based simulation data. Data must be reduced, but which data do we keep or look at? Technical Solution: Apply automatic data selection operations using various metrics and criteria to emulate the exploration during physics simulations (in situ) and make the data bandwidth use more “information dense.” Physics simulations are reduced to a representative set based on the selected metric that is used. Science Impact: Scientist and analyst time is more efficient and bandwidth (cognitive, storage and I/O) is “richer,” as more scientifically-relevant simulation information is packed into it. Woodring, Myers, Nouanesengsy, Wendelberger, Patchett, Fasel, Ahrens Automatic Data Selection for In Situ Analysis discovery events flow control (lightweight) (heavyweight) detectors triggers dynamic computational raw data products data

Combining Statistics, Sampling, and Indexing to Quantitatively Scale Massive Scientific Data Science Problem: Climate and cosmological analysis and exploration via queries on their massive data sets may still result in a massive amount of data for the scientist to comb through and/or transfer. Technical Solution: Combine bitmap indexing with stratified random sampling (stratified by bins, random within bins) with precalculated errors to quickly and quantitatively scale data. Science Impact: Climate and cosmological scientists are able to interactively and automatically scale data queries down to smaller samples with automatic error estimates, accelerating their scientific analysis workflow. Su, Agrawal, Woodring, Myers, Wendelberger, and Ahrens. “Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices.” To appear in HPDC 2013.

Science Problem: Storage and I/O have not kept up with climate and cosmological simulation capacity to generate data. Data must be reduced, but what has been lost in the process of reduction? Technical Solution: After every reducing transformation (T), assess and record the residuals/errors via compact measures (M) comparing reduced (a) to “original” data (A) by reconstructed data (B). Science Impact: Measuring the data uncertainty via a quantitative quality provenance shows that climate and cosmological scientists are able to use reduced data for scientific analysis. Woodring, Shafii, Biswas, Myers, Wendelberger, Hamann, and Shen. “Metrics and Workflow for Quantifying the Quality of Reduction Transformations on Large-Scale, Scientific Scalar Data.” Submitted to LDAV Metrics and Workflow for Quantifying the Quality of Reduction Transformations on Large-Scale, Scientific Scalar Data

Expected ImpactMilestones and Status Novel Ideas Principal Investigator: James Ahrens et al., LANL Sept. 25, 2013 In our first year, we have developed several data selection algorithms, designed a prototype visualization and analysis system that utilizes selected data, and quantified the effects on selecting scientific data. MilestoneExpectedActual Data Selection Algorithms03/1303/13 Data Product Explorer09/1309/13 Selection Quality Measurements09/1309/13 Advanced Selection Algorithms03/14 Perceptually-Driven Presentation03/14 Selection in Data Product Explorer09/14 Domain-Driven Selection03/15 Bandwidth Utilization Adjustment09/15 Advanced Presentation Methods09/15 Save compute time as only selected data are stored with limited I/O bandwidth and capacity Not all data can be saved from an exascale simulation, therefore we must be prescriptive on what data are saved We will provide data selection algorithms Save analysis time as selected data are presented to the scientist The resulting data will still be massive, more than any one scientist can look at Interaction, query, and highlighting methods drive scientists to view key data Perception, cognitive capacity, and computer bandwidth are limited, but the scale of data continues to increase. In order to save both compute cycles and analyst time, we explicitly select and present data to the scientist: Time, space, and variable selection algorithms Store and index selected data products Interactive presentation and query methods Illustratively and artistically highlight to perceptually drive scientists to key data IMD Exploration of Exascale In Situ Visualization and Analysis Approaches Exascale and “Big Data” running simulations, running experiments, static repositories, etc. Exascale and “Big Data” running simulations, running experiments, static repositories, etc. Data Selection Algorithms (time, space, variable, product, etc.) Analysis with Perceptually-Driven Highlighting raw image geo- metry … …