Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Slides:



Advertisements
Similar presentations
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Advertisements

Chapter 23 Minimum Spanning Tree
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Single Source Shortest Paths
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Analysis of Algorithms
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
1 Theory I Algorithm Design and Analysis (10 - Shortest paths in graphs) T. Lauer.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
CSE 421 Algorithms Richard Anderson Dijkstra’s algorithm.
Shortest Paths Definitions Single Source Algorithms –Bellman Ford –DAG shortest path algorithm –Dijkstra All Pairs Algorithms –Using Single Source Algorithms.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Shortest Path Problems Directed weighted graph. Path length is sum of weights of edges on path. The vertex at which the path begins is the source vertex.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Shortest Paths Definitions Single Source Algorithms
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
DAST 2005 Tirgul 12 (and more) sample questions. DAST 2005 Q.We’ve seen that solving the shortest paths problem requires O(VE) time using the Belman-Ford.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Tirgul 4 Order Statistics Heaps minimum/maximum Selection Overview
The Complexity of Algorithms and the Lower Bounds of Problems
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
1 GRAPHS - ADVANCED APPLICATIONS Minimim Spanning Trees Shortest Path Transitive Closure.
Order Statistics. Order statistics Given an input of n values and an integer i, we wish to find the i’th largest value. There are i-1 elements smaller.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
1 Closures of Relations: Transitive Closure and Partitions Sections 8.4 and 8.5.
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
New Characterizations in Turnstile Streams with Applications
Lecture 18: Uniformity Testing Monotonicity Testing
Shortest Path Problems
Shortest Path Problems
EMIS 8374 Dijkstra’s Algorithm Updated 18 February 2008
Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Data Integration with Dependent Sources
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
ICS 353: Design and Analysis of Algorithms
Enumerating Distances Using Spanners of Bounded Degree
CIS 700: “algorithms for Big Data”
Greedy Algorithms / Dijkstra’s Algorithm Yin Tat Lee
Randomized Algorithms CS648
עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב
Algorithms (2IL15) – Lecture 5 SINGLE-SOURCE SHORTEST PATHS
Analysis and design of algorithm
CSCI B609: “Foundations of Data Science”
Shortest Path Problems
Lecture 14 Shortest Path (cont’d) Minimum Spanning Tree
Shortest Path Problems
Compact routing schemes with improved stretch
Shortest Path Problems
The Selection Problem.
Lecture 13 Shortest Path (cont’d) Minimum Spanning Tree
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Time Complexity and the divide and conquer strategy
Data Structures and Algorithm Analysis Lecture 8
Presentation transcript:

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996

Agenda  Intro & Motivation  Algorithm sketch  The estimation framework  Estimating reachability  Estimating neighborhood sizes

Introduction o Descendant counting problem: “Given a directed graph G compute for each node number of nodes reachable from it and the total size of the transitive closure”

Introduction  - set of nodes reachable from node  Transitive closure size:  Example: |S(‘A’)|=5, |S(‘B’)|=3 T=|S(‘A’)|+|S(‘B’)|+…= 15 A D C B E

Motivation  Applicable for DB-query size estimations  Data mining  Matrixes multiplications optimizations  Parallel DFS algorithms optimizations

Framework algorithm sketch  Least descendant mapping Given graph G(V,E) with ranks on it’s nodes compute a mapping for each node v in V to the least-ranked node in S(v) A4A4 D2D2 C5C5 B1B1 E3E3 Example: LE(‘A’) = 1 LE(‘C’) = 2

Framework algorithm sketch  The LE (least element) is highly correlated with size of S(v) !!  The precision can be improved by applying several iterations with random ranks assignment and recalculation of LE

The estimation framework  Let X be a set of elements x with non- negative weights w(x).  Let Y be a set of labels y, and mapping S: from labels y to subsets of x  Our object is to compute an estimate on: - assuming X,Y and weights are given but it’s costly to calculate w(S(y)) for all y’s

The estimation framework  Assume we have the following LE (LeastElement) Oracle: given ranks R(x) on elements of X, LE(y) returns element with minimal rank in S(y) in O(1) time:  The estimation algorithm will perform k iterations, where k is determined by required precision

The estimation framework  Iteration: Independently, for each x in X select a random rank R(x) from exponential distribution with parameter w(x) Exponential distribution function will be: Apply LE on selected ranking and store obtained min-ranks for each y in Y

The estimation framework  Proposition: The distribution of minimum rank R(le(y)) depends only on w(S(y))  Proof: The min of k r.v.’s with distribution with parameters has distribution with parameter  Our objective now is to estimate distribution parameter from given samples

The estimation framework  Mean of exponentially distributed with parameter λ r.e.’s is: 1/λ  We can use this fact to estimate λ from samples by 1/(samples mean)  Use this to estimate w(S(y)) from minimal ranks obtained from k iterations:

The estimation framework  More estimators: Selecting k(1-1/e) –smallest sample of k samples. (Like median for uniform distribution) Using this non-intuitive average estimator:

The estimation framework  Complexity so far: Allowing relative tolerated error ε we need to store significant bits for R’s k assignment iterations will take O(k|X|) time + k*O(Oracle setup time)  Asymptotic accuracy bounds (the proof will go later)

Estimating reachability  Objective: Given graph G(V,E) for each v estimate number of its descendants and size of transitive closure:  All we need is to implement an Oracle for calculating LE mapping. Following algorithm inputs arbitrary ranking of nodes in sorted order and does this in O(|E|) time:

Estimating reachability  LE subroutine() Reverse edges direction of the graph Iterate until V = {}  Pop v with minimal rank from V  Run DFS to find all nodes reachable from v (call this set of nodes U)  For each node in U set LE == v  V = V \ U  E = E \ {edges incident to nodes in U}

Estimating reachability  Each estimation iteration takes O(|V|) + O(|E|) assuming we can sort nodes ranks in expected linear time.  Accuracy bounds (from estimator bounds)

Estimating neighborhood sizes  Problem: Given graph G(V,E) with nonnegative edges lengths should be able to give an estimation for number of nodes within distance of at most d from node v – n(v,d)  Our algorithm will preprocess G in time and after that will be able to answer (v,d) queries in time

Estimating neighborhood sizes  N(A,7)={A,B,C,D,E}  N(A,3)={A,C,E}  N(D,0)={D}  N(C,∞)={C}  n(A,7)=5  n(A,3)=3  n(D,0)=1  n(C,∞)=1 A4A4 D2D2 C5C5 B1B1 E3E

Estimating neighborhood sizes  After preprocessing of G we will generate for each node v a list of pairs: ({d1,s1}, {d2,s2},…,{dη,sη}), where d’s stays for distances and s’s stays for estimated neighborhoods sizes. The lists will be sorted by d’s.  To obtain n(v,d) we’ll look for a pair i such that and return

Estimating neighborhood sizes  The algorithm will run k iterations, in each iteration it will create for each node in G a least-element list ( {d1,v1}, {d2,v2},…,{dη,vη}) such that for any neighborhood (v,d) we will be able to find a min-rank node using the list: for min-rank node will be:

Estimating neighborhood sizes Neighborhoods:  N(A,7)={A,B,C,D,E}  N(A,3)={A,C,E}  N(D,1)={C,D}  N(C,∞)={C} LE-lists:  A: ({A,0}{E,1}{D,2}{B,4})  B: ({B,0})  C: ({C,0})  D: ({D,0})  E: ({E,0}{D,3}) A4A4 D2D2 C5C5 B1B1 E3E

Estimating neighborhood sizes - alg  sub Make_le_lists() Assume nodes are sorted by rank in increasing order Reverse edge direction of G For i=1..n:, For i=1..n (modified Dijkstra’s alg.) DO: (next slide)

Estimating neighborhood sizes - alg I. Start with empty heap, place on heap with label 0 II. Iterate until the heap is empty:  Pop node v k with minimal label d from the heap  Add pair to v k ’s LE-list, set For each out-edge of v k : If is in the heap – update its label to Else: if place on the heap with label

B1B1 ∞ Estimating neighborhood sizes - demo A4A4 D2D2 C5C5 E3E B:0 AA:0E:1D:2B:4 BB:0 CC:0 DD:0 EE:0D:3 A:4 D:0 A:2 E:3 E:0 A:0 C:0 A:1 ∞4 0 ∞ ∞ ∞

Estimating neighborhood sizes - analysis  Correctness Proposition 1: A node v is placed on heap in iteration i if an only if If v is placed on the heap in iteration i, then the pair is placed on v’s list and the value d is updated to be

Estimating neighborhood sizes - analysis  Complexity Proposition 2: If the ranking is a random permutation, the expected size of LE-lists is O(log(|V|) The proof is based on proposition 1 and divide&conquer style analysis -

Estimating neighborhood sizes - analysis (proof cont) Assume LE-list of node u contains x pairs. Consider nodes v sorted by their distance to node u: v1,v2,…. According to preposition 1 node v will enter heap at iteration i iff all the nodes with lower ranks are farer from u than is. Random ranks are expected to partition v1,v2,… sequence such that rank i will be nearer to u than about half of nodes with ranks > i. It follows that x is ~ O( log|V| )

Estimating neighborhood sizes - analysis  Complexity (cont) Running time: Using Fibonacci heaps we have O(log|V|) pop() operation and O(1) insert() or update(). Let be a number of iterations in which was placed on the heap (0<i≤|V|). It follows that running time is: As is also a size of LE-list we get:

Estimating neighborhood sizes K – iterations issues What to do with obtained k LE-lists per node? Naïve way brings us to O(k*loglog|V|) time. It can be improved to O(logk + loglog|V|) by merging the lists and storing sums of ranks / breakpoint. Total algorithm setup time is:

This page has intentionally left blank

Summary  General size-estimation framework  Two applications – transitive closure size estimation and neighborhoods size estimation

A4A4 D2D2 C5C5 B1B1 E3E THE END!