Putting Lipstick on Pig: Enabling Database-styleWorkflow Provenance YaelAmsterdamer,Susan B.Davidson,Daniel Deutch Tova Milo,Julia Stoyanovich,ValTannen.

Slides:



Advertisements
Similar presentations
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Advertisements

IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Open Provenance Model Tutorial Session 2: OPM Overview and Semantics Luc Moreau University of Southampton.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Spark: Cluster Computing with Working Sets
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Provenance-aware Storage Systems Kiran-Kumar Muniswamy-Reddy David A. Holland Uri Braun Margo Seltzer Harvard University.
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
I. Pribela, M. Ivanović Neum, Content Automated assessment Testovid system Test generator Module generators Conclusion.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Restore : Reusing results of mapreduce jobs Jun Fan.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
CERN – European Organization for Nuclear Research Administrative Support - Internet Development Services CET and the quest for optimal implementation and.
Information & Communication Technology (ICT) Books: 1. Management Information Systems James A. O’Brien & George M. Marakas 2. Introduction Of Information.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
A Scalable Service Architecture for Distributed Search Mark Jessop University of York.
VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
A Semi-Canonical Form for Sequential Circuits Alan Mishchenko Niklas Een Robert Brayton UC Berkeley Michael Case Pankaj Chauhan Nikhil Sharma Calypto Design.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Spring Staff Lecturer: Prof. Sara Cohen Graders: Igor Lifshits, Arbel Moshe 2.
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
1 / 23 Presenter: Dong Dai, DISCL Lab. TTU Data-Intensive Scalable Computing Laboratory Department of Computer Science Accelerating Scientific.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Software Testing.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
MSBIC Hadoop Series Processing Data with Pig
Pig Latin - A Not-So-Foreign Language for Data Processing
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Pig Latin: A Not-So-Foreign Language for Data Processing
The Idea of Pig Or Pig Concepts
CSE 491/891 Lecture 21 (Pig).
Jason Flinn Michael Chow, David Devecsery, Xianzheng Dou, Andrew Quinn
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Software Testing.
Presentation transcript:

Putting Lipstick on Pig: Enabling Database-styleWorkflow Provenance YaelAmsterdamer,Susan B.Davidson,Daniel Deutch Tova Milo,Julia Stoyanovich,ValTannen Using slides by Guozhang Wang

Data-Intensive complex computational processes often generate many final and intermediate data products Scientists and engineers need to expend substantial effort managing data and recording provenance information so that basic questions can be answered. Provenance in context of Workflows

Who created this data product and when? When was it modified and by whom? What was the process used to create the data product? Were two data products derived from the same raw data? Provenance in context of Workflows

an essential component to allow for result reproducibility verifiability sharing and knowledge re-use in the scientific community Provenance in context of Workflows

Workflow Provenance Motivated by ScientificWorkflows ◦ Community :IPAW ◦ Interests:process documentation,data derivation and annotation,etc ◦ Model :OPM

OPM Model Annotated directed acyclic graph ◦ Artifact:immutable piece of state ◦ Process:actions performed on artifacts,result in new artifacts ◦ Agents:execute and control processes Aims to capture causal dependencies between agents/processes Each process is treated as a“black-box”

Example:Car Dealership

Data Provenance (for Relational DB and XML) Motivated by Prob.DB,data warehousing.. ◦ Community: SIGMOD/PODS ◦ Interests:data auditing,data sharing, etc ◦ Model:Semiring (etc)

Semiring K-relations ◦ Each tuple is uniquely labeled with a provenance“token” Operations: ◦ :join ◦ + :projection ◦ 0 and 1:selection predicates

Data Provenance Researchers Workflow Provenance Researchers

Semiring Comes to Meet OPM

OPM’s Drawbacks in Semiring People’s Eyes The black-box assumption:each output of the module depends solely on all its inputs ◦ Cannot leverage the common fact that some output only depends on small subset of inputs ◦ Does not capture internal state of a module So:replace it with Semirings!

The Idea General workflow modules are complicated,and thus hard to capture its internal logic by annotations However,modules written in Pig Latin is very similar to Nested Relational Calculus (NRC),thus are much more feasible

Pig Latin Data:unordered (nested) bag of tuples Operators: ◦ FOREACH t GENERATE f1, f2, … OP(f0) ◦ FILTER BY condition ◦ GROUP/COGROUP ◦ UNION, JOIN, FLATTEN, DISTINCT …

Example:Car Dealership

Bid Request Handling in Pig Latin

ProvenanceAnnotation

ProvenanceAnnotation 1.1 Provenance node and value nodes ◦ Workflow input nodes ◦ Module invocation nodes ◦ Module input/output nodes

ProvenanceAnnotation I.2 State nodes ◦ P-node for the tuple ◦ P-node for the state

ProvenanceAnnotation 2.1 FOREACH (projection,no OP) ◦ P-node with“+”

ProvenanceAnnotation 2.2 JOIN ◦ P-node with“*”

ProvenanceAnnotation 2.3 GROUP ◦ P-node with“∂”

ProvenanceAnnotation 2.4 FOREACH (aggregation,OP) ◦ V-node with the OP name

ProvenanceAnnotation 2.5 COGROUP ◦ P-node with“∂”

ProvenanceAnnotation 2.6 FOREACH (UDF Black Box) ◦ P-node/V-node with the UDF name

Query Provenance Graph Zoom-In v.s.Zoom-Out Coarse-grained Fine-grained

Query Provenance Graph Deletion Propagation ◦ Delete the tuple P-node and its out-edges ◦ Repeated delete P-nodes if  All its in-edges are deleted  It has label and one of its in-edges is deleted

Implementation and Experiments Lipstick prototype ◦ Provenance annotation coded in Pig Latin, with the graph written to files ◦ Query processing coded in Java and runs in memory. Benchmark data ◦ Car dealership:fixed workflow and # dealers ◦ Arctic Station:Varied workflow structure and size

Annotation Overhead Overhead increases with execution time

Annotation Overhead Parallelism helps with up to # modules

Loading Graph Overhead Increase with graph size (comp.time < 8 sec)

Loading Graph Overhead Feasible with various sizes (comp.time < 3 sec)

Subgraph QueryTime Query efficiently with sub-second time

Conclusions Studied fine-grained provenance for workflows Individual modules implemented in Pig Latin Provenance model for Pig Latin queries Data provenance ideas such as Semirings can be brought to workflow provenance Thank You!