1 / 23 Presenter: Dong Dai, DISCL Lab. TTU Data-Intensive Scalable Computing Laboratory Department of Computer Science 04-29-2016 Accelerating Scientific.

Slides:

Advertisements

Similar presentations

Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.

Advertisements

Starfish: A Self-tuning System for Big Data Analytics.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.

Open Provenance Model Tutorial Session 4: Use Cases.

Spark: Cluster Computing with Working Sets

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

A Service for Data-Intensive Computations on Virtual Clusters Rainer Schmidt, Christian Sadilek, and Ross King Intensive 2009,

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.

A Semantic Workﬂow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

1 The Architectural Design of FRUIT: A Family of Retargetable User Interface Tools Yi Liu, H. Conrad Cunningham and Hui Xiong Computer & Information Science.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Cracow Grid Workshop, October 27 – 29, 2003 Institute of Computer Science AGH Design of Distributed Grid Workflow Composition System Marian Bubak, Tomasz.

Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.

Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.

A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.

Provenance Challenge Simon Miles, Mike Wilde, Ian Foster and Luc Moreau.

DataNet – Flexible Metadata Overlay over File Resources Daniel Harężlak 1, Marek Kasztelnik 1, Maciej Pawlik 1, Bartosz Wilk 1, Marian Bubak 1,2 1 ACC.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

National Library of Finland Strategic, Systematic and Holistic Approach in Digitisation Cultural unity and diversity of the Baltic Sea Region – common.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Recommendations for caBIG to Support Semantic Workflows Yolanda Gil, PhD.

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Of 24 lecture 11: ontology – mediation, merging & aligning.

18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Enhancements to Galaxy for delivering on NIH Commons

University of Chicago and ANL

A scalable approach for Test Automation in Vector CAST/Manage with

An Adaptive Middleware for Supporting Time-Critical Event Response

Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.

Presentation transcript:

1 / 23 Presenter: Dong Dai, DISCL Lab. TTU Data-Intensive Scalable Computing Laboratory Department of Computer Science Accelerating Scientific Discovery by Tracking Data Provenance A work done with Dr. Yong Chen (TTU), Dr. Robert Ross (ANL), Dr. Phil Carns (ANL), and Dr. John Jenkins (ANL)

2 / 23 Outline Provenance: A Quick Introduction Motivation: A Scientific Example LPS: Lightweight Provenance Service LPS: Performance Evaluation Conclusion & Current Work

3 / 23 Provenance

4 / 23 Provenance: A Quick Introduction Provenance, also referred as lineage, is metadata that describe the history of a piece of data [In Computer Science] It contains: Links between results and corresponding starting conditions or configuration parameters Links between a data piece and applications and many other attributes information… It provides the ability for users to point at a piece of data and ask: where did it come from?

5 / 23 Provenance: A Quick Introduction OPM (Open Provenance Model) is the official model for provenance data. Provenance is normally organized as graph with properties Queries on provenance normally are translated to graph operations

6 / 23 W3C Provenance Incubator Group Use Cases Result Differences Anonymous Information Information Quality Assessment for Linked Data Timeliness Simple Trustworthiness Assessment Ignoring Unreliable Data Answering user queries that require semantically annotated provenance Provenance in Biomedicine Closure of Experimental Metadata Provenance Tracking in the Blogosphere Provenance of a Tweet Provenance and Private Data Use Provenance of Decision Making in Emergency Response Provenance of Collections vs Objects in Cultural Heritage Provenance at different levels in Cultural Heritage Locating Biospecimens With Sufficient Quality Identifying attribution and associations Determining Compliance with a License Documenting axiom formulation Evidence for public policy Evidence for engineering design Fulfilling Contractual Obligations Attribution for a versioned document Provenance for Environmental Marine Data Crosswalk Maintenance Metadata Merging Mapping Digital Rights Computer Assisted Research Handling Scientific Measurement Anomaly Human-Executed Processes Semantic disambiguation of data provider identity Hidden Bug Using process provenance for assessing the quality of Information products

7 / 23 An Example of Provenance: Blog Aggregators It wants to determine the correct originator of an item It wants to determine if it can reuse an image It wants to determine if the image was modified It want to provide confidence to the end-user

8 / 23 An Example of Provenance: Blog Aggregators

9 / 23 An Example of Provenance: Blog Aggregators

10 / 23 Motivation: Scientific Cases

11 / 23 Motivation: A Scientific Example Assume a workflow 1 for creating population-based "brain atlases" from the high resolution fMRI anatomical data. 1.The inputs are a set of new brain images and a single reference brain image. 2.For each image, there is the actual image and the metadata information for that image 3.The output is a graphical atlas image converted from series of processing 1.First Provenance Challenge,

12 / 23 The orange ones are procedures The blue ones are data items First Provenance Challenge,

13 / 23 Motivation: A Scientific Example Understanding how a result is generated! Assume we have thousands of images that have been processed through this workflow (including intermediate results). 1.Find all the processes that led to a given Atlas X Graphic 2.Find all invocations of procedure align_warp using parameter "-m 12” that ran on a Monday 3.Find all graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry maximum=4095.

14 / 23 LPS: Lightweight Provenance System for HPC Platforms

15 / 23 Existing Provenance Systems Figure from “Lucian, et. al. A Primer on Provenance, Communication of ACM, May, 2014” Provenance systems have been widely researched

16 / 23 Existing Provenance Systems - Problems Performance Consideration: Collecting provenance introduces overhead, However, HPC is performance sensitive. Expect less than 1% slowdown and a small memory footprint on any provenance system running in HPC Granularity Requirements HPC applications could be workflow, parallel job, or even single process. Data could be shared and accessed concurrently. Finer granularity provenance like process accessed a file on certain time is needed. Simplicity Consideration No extra modification on applications or users’ behaviors should be allowed

17 / 23 LPS: Lightweight Provenance System (Architecture)

18 / 23 LPS: Local Provenance Collector Operation System Instrumentation Systemtap script Probe on critical system calls, like create/exit, open/close, read/write, etc. Read /proc file system Dynamic Instrumentation Two systemtap scripts, one only captures basic system calls like open/close; one captures frequent ones like read/write Dynamic adjust by end users Advantage : Transparent to users and applications Covering finer granularity activities on advanced capture Low overhead on basic capturing

19 / 23 LPS: Execution Provenance Aggregator Execution provenance aggregator combines processes and jobs collected in different locations together In login nodes, LPS collects job submitting operations In compute nodes, LPS collects process executions LPS uses environment variables of these processes to map entities together Each MPI process will have a PBS_JOBID env. Variable Each job submission command will get the same job Id as return value Advantage : A unified interface. Any distributed computing framework, like Hadoop MapReduce, can be supported.

20 / 23 LPS: Data Access Provenance Aggregator It determines the correct data dependencies Different processes of a job may access the same file Different jobs may access the same file Different users may access the same file Their order (causality) is critical as they determine the data dependencies. There are three different granularities on data dependency in LPS

21 / 23 LPS: Performance Evaluation

22 / 23 LPS: Overhead On HPCG Benchmark

23 / 23 LPS: Overhead on IOR Benchmark

24 / 23 LPS: Overhead on MDTest Benchmark

25 / 23 LPS: Overhead on MDTest Benchmark

26 / 23 LPS: Overhead With Extreme Scalable IO

27 / 23 Conclusion & Current Work

28 / 23 Conclusion & Concurrent Work Conclusion : It is possible to transparently collect provenance in HPC environment. The overhead can be controlled under 1% if willing to loss accuracy on data dependency Dynamic changing granularity is possible in HPC environment to satisfy different users Current and Future Work : Prototype Implemented Provenance query tool is undergoing Deploy in production HPC cluster is planned

29 / 23 Thanks & Questions

30 / 23