Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University.

Slides:



Advertisements
Similar presentations
한양대학교 정보보호 및 알고리즘 연구실 이재준 담당교수님 : 박희진 교수님
Advertisements

2. Getting Started Heejin Park College of Information and Communications Hanyang University.
EE384y: Packet Switch Architectures
Unit-iv.
Analysis of Computer Algorithms
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Analysis of Algorithms
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
©2001 by Charles E. Leiserson Introduction to AlgorithmsDay 9 L6.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 6 Prof. Erik Demaine.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Chapter 3: Top-Down Design with Functions Problem Solving & Program Design in C Sixth Edition By Jeri R. Hanly & Elliot B. Koffman.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Programming Language Concepts
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.
Chapter 4: Informed Heuristic Search
Minimum Weight Plastic Design For Steel-Frame Structures EN 131 Project By James Mahoney.
Ack: Several slides from Prof. Jim Anderson’s COMP 202 notes.
Recurrences : 1 Chapter 3. Growth of function Chapter 4. Recurrences.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Chapter 9 -- Simplification of Sequential Circuits.
Digital Logic Design Gate-Level Minimization
COMP 482: Design and Analysis of Algorithms
Chapter 6: Developing a Project Plan
Comp 122, Spring 2004 Graph Algorithms – 2. graphs Lin / Devi Comp 122, Fall 2004 Identification of Edges Edge type for edge (u, v) can be identified.
VOORBLAD.
演 算 法 實 驗 室演 算 法 實 驗 室 On the Minimum Node and Edge Searching Spanning Tree Problems Sheng-Lung Peng Department of Computer Science and Information Engineering.
Direct-Current Circuits
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
Optimization 1/33 Radford, A D and Gero J S (1988). Design by Optimization in Architecture, Building, and Construction, Van Nostrand Reinhold, New York.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
CS 240 Computer Programming 1
25 seconds left…...
Slippery Slope
Chapter 10: The Traditional Approach to Design
Analyzing Genes and Genomes
Systems Analysis and Design in a Changing World, Fifth Edition
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Computer Science: A Structured Programming Approach Using C Stacks A stack is a linear list in which all additions and deletions are restricted to.
Intracellular Compartments and Transport
PSSA Preparation.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.
Essential Cell Biology
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Link-State Routing Protocols Routing Protocols and Concepts – Chapter.
State Variables.
Distributed Computing 5. Snapshot Shmuel Zaks ©
The Pumping Lemma for CFL’s
Shortest Paths (1/11)  In this section, we shall study the path problems such like  Is there a path from city A to city B?  If there is more than one.
Compiler Construction
Secret Sharing, Matroids, and Non-Shannon Information Inequalities.
EMLAB 1 Chapter 2. Resistive circuits
Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.
Section 3.4 The Traveling Salesperson Problem Tucker Applied Combinatorics By Aaron Desrochers and Ben Epstein.
Enabling Privacy in Provenance- Aware Workflow Systems Susan B. Davidson 1 Joint work with Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen.
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Presentation transcript:

Optimizing User Views for Workflows Sudeepa Roy (with Olivier Biton, Susan Davidson and Sanjeev Khanna) ZOOM Project, Database Research Group University of Pennsylvania 1

Workflow Start (s) Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees end (t) 2 Graphical representation of a sequence of actions to perform a task (eg. a biological experiment) Vertex Module (program) Takes a set of data items as input Produces a set of data items as output Edge Control (and Data) flow Data is typically a file Has a start (s) and an end (t) module Run: An execution of the workflow Actual data appears on the edges A module can be executed when data on each incoming edges have been computed TGCCG TGTGG CTAAAT G… CTGTG C … CTAAAT GTCTG TGC… GGCTA AATGTC TG TGCCG TGTGG CGTC… ATCCGT GTGGC TA..

High throughput technologies generate huge amount of data, which must be analyzed in computational experiments The analysis may be complex and multi-step Scientific workflow systems are frequently used to help conceptualize and manage the analysis process as well as intermediate and final data products Increasing need to record the provenance (i.e. the origin or history) of data products defined as a depends-on relationship between module execution and other data products many scientific workflow systems (e.g. Vistrails, Kepler, Taverna) now support provenance Data Provenance in Scientific Workflows 3

Need for Provenance 4 TGCCGTGT GGCTAAAT GTCTGTGC … CCCTTTCC GTGTGGCT AAATGTCT GTGC … TGCCGTGT GGCTAAAT GTCTGTGC GTCTGTGC … TGCCGTGT GGCTAAAT GTCTGTGC GTCTGTGC … TGCCGTGT GGCTAAAT GTCTGTGC … ATGGCCGT GTGGTCTG TGCCTAAC TAACTAA… Alignments ClustalW PAUPS Phillips … Bootstrap Biologists workspace Bioinformatics protocols Which sequences have been used to produce this tree? How this tree has been generated? ? Can I throw away some of these data? Which ones are really important to keep? s Split Entries Align Sequences Functional DataCurate Annotations Format Construct Trees t

Provenance Overload s Split Entries Align Sequences Functional DataCurate Annotations Format-2 Format-1 Format-3 Construct Trees t 5 Workflow Specification s Split Entries Align Sequences Functional Data Curate Annotations Format Construct Trees t Workflow run d 1 …d 100 d 201 …d 301 d 302 …d 402 d 403 d 404 …d 454 d 455 d 456 d 457 d 458 d 459 d 460 Construct Trees immediate provenance deep provenance Curate Annotations Format-3Format-2 Functional Data Format-1 Align Sequences Split Entries s Can we reduce the amount of provenance shown to the user?

Relevant Modules and Composition 6 [BCD + 08] shows how to focus user attention on relevant portion of provenance information User specifies relevant modules System creates composite modules (clusters) The result is called a user-view s Construct Trees t Align Sequences s Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees t

User-view Reduces Provenance Information 7 d 459 d 458 d 460 d 201 …d 301 d 456 M1M1 M2M2 M3M3 What properties should a good user-view have? Problem: Can the number of clusters be minimized in a good user-view? s Construct Trees Align Sequences s Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees t

Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 8 Outlines

Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 9

Workflow Specification Workflow Specification: (G, s, t, R) A directed graph G(V, E) Unique start module (source) s and unique finish module (sink) t R: set of relevant modules NR: V – R, non-relevant modules s, t R |V| = n, |E| = m, |R| = k 10 s R-node NR-node t

User View H: User-View of (G, s, t, R) A directed graph, H, whose nodes are clusters/composite modules of nodes in G. The nodes of H form a partition of the nodes in G. An edge e = (u, v) in G survives in H as e if the end points u, v belong to different clusters in H The edge e in G induces the edge in H or e is an origin of e R-cluster: contains at least one R-node NR-cluster: contains only NR-nodes 11 R-cluster NR-cluster s t

12 Direct dependencies between relevant clusters should be preserved, defined in terms of elementary path: a path where all the intermediate nodes are NR-nodes At most one R-node in each cluster: R-cluster assumes the meaning of the R-node Good and Bad User Views r1r1 r3r3 r2r2 r4r4 SpecificationBad view-1Bad view-2Good view-1Good view-2

Three Properties of a Good User-view 13 Property 1 (well-formed) each cluster in H should contain at most one R-node from G r1r1 G: SpecificationH: User-view r1r1 r4r4 r2r2 r3r3 r4r4 r2r2 r3r3

Three Properties of a Good User-view 14 Property 2 (soundness) every edge on an elementary path between two R-clusters in H should have all the origins on an elementary path between the corresponding R-nodes in G r1r1 r3r3 r2r2 r1r1 r3r3 r2r2 d G: SpecificationH: User-view Not sound! r 2 was not dependent on d in G, but dependent in H

Three Properties of a Good User-view 15 Property 3 (completeness) every edge on an elementary path between two R-nodes in G should induce an edge on an elementary path between the corresponding R-clusters in H d SpecificationUser view Not complete! d produced by r 1 was directly consumed by r 3 in G, but processed by r 2 in H r1r1 r3r3 r2r2 r1r1 r3r3 r2r2

Given directed graph G(V, E), source s, sink t, a set of R of R-nodes (s, t R), |R| = k, find a good user view H that minimizes the total number of clusters (optimum user-view) in poly-time. Optimization Problem 16

Can we find an optimum user-view in general directed graphs? Is this problem NP-complete? What about special directed graphs that capture many common workflows? Can we find matching upper and lower bounds of the #clusters in terms of k (= |R|) and not n (= |V|)? In general graphs? In some special graphs? Questions 17 Unknown [BCD + 08] gives a poly-time algorithm to find a minimal good user-view, which may not be of minimum size Optimum clustering for series-parallel graphs Tight bounds for general and series-parallel graphs

Series-Parallel Graphs 18 An edge (Base case) G1G1 G2G2 Series Composition Parallel Composition

Examples: (Non)Series-Parallel Graphs 19 Characterization of two-terminal SP-graph (VTL79) A two-terminal DAG is an SP graph if and only if it does not contain a subgraph homeomorphic to this forbidden subgraph SP graphsNon-SP graphs

Series-Parallel Graph (SP-graphs) s Split Entries Align Sequences Functional DataCurate Annotations Format Construct Trees t SP graphs are the workflow equivalent of structured programming (without iteration) Many workflows encountered in practice are SP graphs and do not allow looping 20 SP graph!

Contributions 21 Optimum Clustering Upper Bound on #clusters Lower Bound on #clusters SP Graphs YES (by an O(n) time algorithm ) 2k - 3 General Graphs ? (2 k-1 – k) 2 + k (analyze the #clusters output by [BCD + 08]) (2 k-1 – k) 2 + k Moreover, we express global conditions for a good user-view in terms of local conditions for each cluster for general graphs… useful when k << n

Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 22

Algorithm SP-View 23 s t Forward-pass Process the vertices in a topological order If an R-node do nothing If an NR-node if single R-predecessor o merge if >= 1 NR-predecessor o merge with last predecessor else o do nothing Produce an intermediate clustering

Algorithm SP-View 24 s t Reverse-pass Take intermediate clustering by Forward pass as input Produce a reverse topological order on the clusters Perform a symmetric procedure as done in the Forward pass on the clusters C10 C7 C6 C8 C9 C5 C4C3 C2 C1 C11 C13 Reduces 16 modules to 10 clusters Cannot do better than 10 (k = 9)! O(m+n) = O(n) time C12

Correctness 25 Proved by induction on each intermediate step for cluster formation Any workflow specification is a good user-view In each step, we preserve the SP-property we have a good user-view use equivalent local conditions for clusters use forbidden subgraph characterization of two- terminal SP graphs [VTL79]

Upper Bound 26 s t #clusters 2k-3 Here we show a weaker bound: 2k-1 Each surviving NR-cluster has at least one unique R-predecessor as a witness t is no ones predecessor! #clusters k + k-1 = 2k-1

Lower Bound 27 s t = r 0 r1r1 r2r2 r k-3 r k-2 = r k-1 p1p1 p2p2 p k-4 p k-3 #nodes = k + k-3 = 2k-3 No two nodes can be merged in any good user-view Optimum #clusters = 2k-3

Optimality 28 Outline of the steps … Suppose SP-View outputs N 1 R-clusters, N 2 NR-clusters total #clusters = N 1 + N 2 N 1 = k, can not be reduced Each NR-cluster contains one essential NR-node that cannot be included in any R-cluster If two essential NR-nodes are put in different clusters by SP- View, no good user-view can put them in the same cluster Any good user view has at least N 2 NR-clusters.

Model and Definitions Workflow Specification User-View Good user-view Series-Parallel Graphs Results for Series-Parallel graphs Algorithm SP-View Correctness Upper bound Lower bound Optimality Results for General graphs Outline 29

Other Results (General Graphs) 30 Upper bound on the number of clusters We show that the algorithm in [BCD + 08] produces (2 k-1 – k) 2 + k clusters This is independent of the total number of nodes n Tight lower bound We show that there exists a graph that needs (2 k-1 – k) 2 + k clusters in any good user-view.

31 Can we solve the optimization problem on general directed graphs? Is it NP-complete? Can we get a constant-factor approximation to the optimum solution? Can we extend our algorithm to handle a larger class of directed graphs? Open Problems

Thank You 32