Lecture 6: Data Versioning

Slides:



Advertisements
Similar presentations
Algorithm Design Techniques
Advertisements

Greedy Algorithms.
Chapter 5 Fundamental Algorithm Design Techniques.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Chapter 4 The Greedy Approach. Minimum Spanning Tree A tree is an acyclic, connected, undirected graph. A spanning tree for a given graph G=, where E.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
Management Science 461 Lecture 2b – Shortest Paths September 16, 2008.
1 Greedy 2 Jose Rolim University of Geneva. Algorithmique Greedy 2Jose Rolim2 Examples Greedy  Minimum Spanning Trees  Shortest Paths Dijkstra.
Network Correlated Data Gathering With Explicit Communication: NP- Completeness and Algorithms R˘azvan Cristescu, Member, IEEE, Baltasar Beferull-Lozano,
Chapter 23 Minimum Spanning Trees
CSC 2300 Data Structures & Algorithms April 17, 2007 Chapter 9. Graph Algorithms.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Cache Placement in Sensor Networks Under Update Cost Constraint Bin Tang, Samir Das and Himanshu Gupta Department of Computer Science Stony Brook University.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
[1][1][1][1] Lecture 2-3: Coping with NP-Hardness of Optimization Problems in Practice May 26 + June 1, Introduction to Algorithmic Wireless.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Parallel Programming – Graph Algorithms David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama,
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Algorithms: Design and Analysis Summer School 2013 at VIASM: Random Structures and Algorithms Lecture 3: Greedy algorithms Phan Th ị Hà D ươ ng 1.
Minimum Cost Flows. 2 The Minimum Cost Flow Problem u ij = capacity of arc (i,j). c ij = unit cost of shipping flow from node i to node j on (i,j). x.
ITEC 370 Lecture 16 Implementation. Review Questions? Design document on F, feedback tomorrow Midterm on F Implementation –Management (MMM) –Team roles.
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
Prims’ spanning tree algorithm Given: connected graph (V, E) (sets of vertices and edges) V1= {an arbitrary node of V}; E1= {}; //inv: (V1, E1) is a tree,
Algorithm Paradigms High Level Approach To solving a Class of Problems.
EMIS 8374 Optimal Trees updated 25 April slide 1 Minimum Spanning Tree (MST) Input –A (simple) graph G = (V,E) –Edge cost c ij for each edge e 
Minimal Spanning Tree Problems in What is a minimal spanning tree An MST is a tree (set of edges) that connects all nodes in a graph, using.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Theory of Algorithms: Greedy Techniques James Gain and Edwin Blake {jgain | Department of Computer Science University of Cape Town.
Spanning tree Lecture 4.
A* optimality proof, cycle checking CPSC 322 – Search 5 Textbook § 3.6 and January 21, 2011 Taught by Mike Chiang.
CS 146: Data Structures and Algorithms July 28 Class Meeting Department of Computer Science San Jose State University Summer 2015 Instructor: Ron Mak
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 11.
Spanning Trees Dijkstra (Unit 10) SOL: DM.2 Classwork worksheet Homework (day 70) Worksheet Quiz next block.
CSE 373: Data Structures and Algorithms Lecture 21: Graphs V 1.
Minimum Spanning Trees
Chapter 12 File Management
CSCE 411 Design and Analysis of Algorithms
Greedy Technique.
Chapter 5. Greedy Algorithms
Network Simplex Animations
Courtsey & Copyright: DESIGN AND ANALYSIS OF ALGORITHMS Courtsey & Copyright:
Database Performance Tuning and Query Optimization
COMP 6/4030 ALGORITHMS Prim’s Theorem 10/26/2000.
Shortest Path Problems
Heuristics Definition – a heuristic is an inexact algorithm that is based on intuitive and plausible arguments which are “likely” to lead to reasonable.
CSCE350 Algorithms and Data Structure
Minimum Spanning Trees
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Shortest Path Problems
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
Spanning Tree Algorithms
Chapter 11 Database Performance Tuning and Query Optimization
Algorithms (2IL15) – Lecture 7
Lecture 14 Shortest Path (cont’d) Minimum Spanning Tree
Weighted Graphs & Shortest Paths
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
CSE 373 Data Structures and Algorithms
Network Simplex Animations
Winter 2019 Lecture 9 Dijkstra’s algorithm
The Minimum Cost Spanning Tree Problem
Prim’s algorithm for minimum spanning trees
Algorithm Course Dr. Aref Rashad
Lecture 13 Shortest Path (cont’d) Minimum Spanning Tree
Prims’ spanning tree algorithm
Multiobjective Optimization
More Graphs Lecture 19 CS2110 – Fall 2009.
Autumn 2019 Lecture 9 Dijkstra’s algorithm
Presentation transcript:

Lecture 6: Data Versioning Credit: Some slides by Chavan et al.

Today’s Lecture Data Hubs Dataset Versioning

Section 1 1. Data Hubs

From old-school applications… Section 1 From old-school applications…

…to the vision of Ground Section 1 …to the vision of Ground

Accessibility: The key challenge Section 1 Accessibility: The key challenge

Section 1 Core challenges A new architecture for storing large numbers of diverse datasets Managed on behalf of different users/groups with sharing/collaboration capabilities Support for different formats Data movement Easy ingest, cleaning, and visualization Data versioning - lineage Infrastructure to host a large number of data-processing apps (ModelHub later!)

Section 2 2. Dataset Versioning

A typical data analysis workflow Section 2 A typical data analysis workflow

The dataset versioning hell Section 2 The dataset versioning hell Many private copies of the datasets lead to massive redundancy in storage No easy way to keep track of dependencies No mechanisms to support and record manual conflict resolution No way to analyze/compare/query versions (across users)

Dataset version control desiderata Section 2 Dataset version control desiderata Branch, update, merge, transform Large unstructured or structured datasets Main challenges: How can we store thousands of versions of datasets compactly? How to access any version, on-demand, efficiently?

Version Control Systems Section 2 Version Control Systems We already have Git/SVN and many more Versioning algorithms optimized to work with code-like data Sparse Local changes (focused in specific parts of the file) Scenario: What if we reformat a date that appears in all tuples of a structured dataset?

Version Control Systems in practice Section 2 Version Control Systems in practice Even git uses large amounts of RAM for large files!

Storage cost is the space required to store a set of versions Section 2 It’s all about costs Storage cost is the space required to store a set of versions

Recreation cost is the time required to access a version Section 2 It’s all about costs Recreation cost is the time required to access a version

How to recreate version Section 2 How to recreate version Use deltas: A delta between versions is a file which allows constructing one version given the other. A delta has its own storage cost and recreation cost Example delta ops: Unix diff, xdelta, XOR, etc.

Storage/Recreation Tradeoff (with delta encoding) Section 2 Storage/Recreation Tradeoff (with delta encoding)

Problem PROBLEM: Find a storage solution that: with Section 2 Problem Given Set of versions Partial information about deltas between versions with deltas being directed/undirected deltas having different storage and recreation costs PROBLEM: Find a storage solution that: Minimizes total recreation cost within storage budget Minimizes maximum recreation cost within a storage budget

Section 2 Multiple versions Let’s focus on directed deltas with identical storage and recreation costs

Section 2 Baseline

Baseline Version 1: Minimize Storage Cost Section 2 Baseline Minimum Cost Arborescence (MCA): Given a digraph D = (V, E) and a root vertex r, an r- arborescence is a subset of arcs 𝐵⊆𝐸 such that for each vertex 𝑣∈𝑉 ∖ {𝑟} there is a unique path from r to v in (V,B). Find r-arborescence of minimum cost in D for a cost function 𝑐:𝐴 ⟶ℝ Edmonds’s algorithm: O(E + VlogV) Version 1: Minimize Storage Cost Recreation Cost: No constraints

Baseline Version 2: Minimize Recreation Cost Section 2 Baseline Shortest Path Tree (SPT) Dijkstra’s algorithm: O(E VlogV) Version 2: Minimize Recreation Cost Storage Cost: No constraints

Local Move Greedy (LMG) heuristic Section 2 Local Move Greedy (LMG) heuristic GOAL: Minimize total recreation cost Start with MCA Iterate until storage budget reached: For each new delta, compute Choose the delta with the highest ρ value

Local Move Greedy (LMG) heuristic Section 2 Local Move Greedy (LMG) heuristic

More heuristics Local Move Greedy (LMG) Section 2 More heuristics Local Move Greedy (LMG) Modified Prim’s (MP): Incrementally build a tree by adapting Prim’s algorithm Light Approximate Shortest path Three (LAST*): Balance minimum spanning tree and shortest path three

Storage budget of 1.1x the MCA reduces total recreation cost by 1000x Section 2 Evaluation Storage budget of 1.1x the MCA reduces total recreation cost by 1000x