Massive Data Streams in Graph Theory and Computational Geometry Ph.D. Dissertation Defense Jian Zhang Advisor: Joan Feigenbaum Committee: Ravi Kannan Avi.

Slides:



Advertisements
Similar presentations
Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Advertisements

Chapter 5: Tree Constructions
Lindsey Bleimes Charlie Garrod Adam Meyerson
Lecture 19: Parallel Algorithms
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Trading off space for passes in graph streaming problems Camil Demetrescu Irene Finocchi Andrea Ribichini University of Rome “La Sapienza” Dagstuhl Seminar.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Noga Alon Institute for Advanced Study and Tel Aviv University
Ad-Hoc Networks Beyond Unit Disk Graphs
1 Spanning Trees Lecture 20 CS2110 – Spring
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Learning-Based Anomaly Detection in BGP Updates Jian Zhang Jennifer Rexford Joan Feigenbaum.
Dept. of Computer Science Distributed Computing Group Asymptotically Optimal Mobile Ad-Hoc Routing Fabian Kuhn Roger Wattenhofer Aaron Zollinger.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Parallel Algorithms III Topics: graph and sort algorithms.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Collective Tree Spanners of Graphs F.F. Dragan, C. Yan, I. Lomonosov Kent State University, USA Hiram College, USA.
Collective Tree Spanners and Routing in AT-free Related Graphs F.F. Dragan, C. Yan, D. Corneil Kent State University University of Toronto.
Additive Spanners for k-Chordal Graphs V. D. Chepoi, F.F. Dragan, C. Yan University Aix-Marseille II, France Kent State University, Ohio, USA.
Approximate Distance Oracles for Geometric Spanner Networks Joachim Gudmundsson TUE, Netherlands Christos Levcopoulos Lund U., Sweden Giri Narasimhan Florida.
Introduction Outline The Problem Domain Network Design Spanning Trees Steiner Trees Triangulation Technique Spanners Spanners Application Simple Greedy.
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
MCS312: NP-completeness and Approximation Algorithms
Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Transitive-Closure Spanner of Directed Graphs Kyomin Jung KAIST 2009 Combinatorics Workshop Joint work with Arnab Bhattacharyya MIT Elena Grigorescu MIT.
Algorithms  Al-Khwarizmi, arab mathematician, 8 th century  Wrote a book: al-kitab… from which the word Algebra comes  Oldest algorithm: Euclidian algorithm.
GRAPH SPANNERS by S.Nithya. Spanner Definition- Informal A geometric spanner network for a set of points is a graph G in which each pair of vertices is.
Near Optimal Streaming algorithms for Graph Spanners Surender Baswana IIT Kanpur.
Data Structures Week 9 Introduction to Graphs Consider the following problem. A river with an island and bridges. The problem is to see if there is a way.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Prims’ spanning tree algorithm Given: connected graph (V, E) (sets of vertices and edges) V1= {an arbitrary node of V}; E1= {}; //inv: (V1, E1) is a tree,
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
Data Structures & Algorithms Graphs
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
A randomized linear time algorithm for graph spanners Surender Baswana Postdoctoral Researcher Max Planck Institute for Computer Science Saarbruecken,
On the Hardness of Optimal Vertex Relabeling and Restricted Vertex Relabeling Amihood Amir Benny Porat.
Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)
Introduction Wireless Ad-Hoc Network  Set of transceivers communicating by radio.
Data Structures for Emergency Planning Cyril Gavoille (LaBRI, University of Bordeaux) 8 th FoIKS Bordeaux – March 3, 2014.
1 GRAPHS – Definitions A graph G = (V, E) consists of –a set of vertices, V, and –a set of edges, E, where each edge is a pair (v,w) s.t. v,w  V Vertices.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Open Problems in Streaming
Improved Randomized Algorithms for Path Problems in Graphs
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
MST in Log-Star Rounds of Congested Clique
Analysis and design of algorithm
Enumerating Distances Using Spanners of Bounded Degree
CIS 700: “algorithms for Big Data”
Graphs & Graph Algorithms 2
CSCI B609: “Foundations of Data Science”
Introduction Wireless Ad-Hoc Network
Range-Efficient Computation of F0 over Massive Data Streams
Forbidden-set labelling in graphs
Presentation transcript:

Massive Data Streams in Graph Theory and Computational Geometry Ph.D. Dissertation Defense Jian Zhang Advisor: Joan Feigenbaum Committee: Ravi Kannan Avi Silberschatz Sampath Kannan (UPenn) Support: NSF grants and

June 15, 2005J. Zhang - Ph.D. Dissertation Defense2 Talk Outline Streaming computational model Overview of results Approximate graph distances in the streaming model Future research directions

June 15, 2005J. Zhang - Ph.D. Dissertation Defense3 Data Streams A data stream is a sequence of data elements: a 1, a 2, …, a n. Stream of stock prices Stream of IP packets Data elements have different forms in different applications. Scalar value Tuple The semantics of the data elements are also different in different applications.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense4 Streaming Computational Model Sequential access to the input stream Order of data elements in the stream is not controlled by the algorithm and may be adversarial. Algorithms may perform pre- or post-processing without access to the data stream. Working Space STREAM

June 15, 2005J. Zhang - Ph.D. Dissertation Defense5 Features of Streaming Algorithms Small working space compared to the stream length n Polylog n n  Small number of passes over the stream One pass Constant number of passes Fast per-data-element processing time

June 15, 2005J. Zhang - Ph.D. Dissertation Defense6 Sliding-Window Model A variation of streaming Data stream is a time series and may be infinite. Consider the n most recent data elements. As time progresses, new data elements arrive, and old data elements expire. The deletion of old data elements is implicit.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense7 Why Streaming ? Data streams occur in real systems. IP-traffic flow Need to distinguish the working space from the data storage. Storage devices: large capacity but slow access Working space: small capacity but fast random access We want to restrict random access to the mass storage but still see every element of the input set at least once.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense8 Earlier Work on Streaming Despite the restrictions of the model, a lot can be done, e.g.: L p norms [FKSV02, Indyk00] histograms [GKS01] clustering [GMMO00] Much of the work focuses on computing statistics. Often the working-space size is restricted to polylog space.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense9 Talk Outline Streaming computational model Overview of results Approximate graph distances in the streaming model Future research directions

June 15, 2005J. Zhang - Ph.D. Dissertation Defense10 Dissertation Contributions Investigate important problem domains. Computational geometry problems Graph problems Show the importance of a more relaxed model. Sublinear space instead of polylog space Multiple passes There are problems that are provably hard in the restricted model but feasible in the more relaxed model.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense11 Results on Geometric Problems (1) Exact computation is hard using sublinear space. Computing the exact Diameter, Closest Pair, or Convex Hull requires  (n) bits of space, where n is the number of points in the stream. Approximation is feasible. We give a one-pass, ε-approximation, streaming algorithm for diameter. The algorithm needs storage for O(1/ε) points and processes each point in O(log(1/ε)) time. [ Feigenbaum-S. Kannan-Zhang ]

June 15, 2005J. Zhang - Ph.D. Dissertation Defense12 Results on Geometric Problems (2) We give an ε-approximation algorithm that maintains the diameter in the sliding-window model. The algorithm uses O(1/ε log 3 n logR) bits of space, where R is the largest diameter attained in any window. The amortized processing time for each point is O(logn). We show that is  (1/ε logn logR) space is required for such an approximation.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense13 Graph Stream Consider undirected graph: G =(V,E) V = {v 1, v 2, …, v n } E = {e 1, e 2, …, e m } A graph stream is a sequence of edges in E. Edges arrive in arbitrary order in the stream. More general than adjacency matrices or adjacency lists (4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)

June 15, 2005J. Zhang - Ph.D. Dissertation Defense14 Results on Graph Problems (1) Many problems require  (n) bits of space. Graph distances (even approximation), Connectivity testing, Planarity testing … Consider streaming algorithms that use O(n·polylogn) space and O(1) passes. In such a model, we can compute or approximate: Spanning trees Graph distances [ Feigenbaum-S. Kannan-McGregor-Suri-Zhang ]

June 15, 2005J. Zhang - Ph.D. Dissertation Defense15 Results on Graph Problems (2) (1+ ,  )-approximation: Our algorithm outputs {  (u,v)} s.t.  (u,v)  (1+  ) dist G (u,v) + , where dist G (u,v) is the true distance between vertices u and v. The algorithm uses O(n 1+1/k ) space. Processing time per edge is O(n 1/k ). Needs multiple passes. 1/k and  are arbitrarily small parameters.  and the number of passes are functions of k and 1/ . [ Elkin-Zhang ] We give a randomized streaming algorithm that approximates graph distances:

June 15, 2005J. Zhang - Ph.D. Dissertation Defense16 Results on Graph Problems (3) We give a one-pass, streaming algorithm for approximating graph distances. (2t+1)-approximation:  (u,v)  (2t+1)·dist G (u,v) O(t·n 1+1/t ·logn) space Processing time per edge: O(t 2 ·n 1/t ·logn) Needs one pass. Lower bound: The space complexity of one-pass, t-approximation is  (n 1+1/t ). [ Feigenbaum-S. Kannan-McGregor-Suri-Zhang ] For t = log n, this gives a one-pass, O(logn)-approximation algorithm using n·polylog space and polylog time per edge.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense17 Publications J. Feigenbaum, S. Kannan, and J. Zhang, “Computing Diameter in the Streaming and Sliding-Window Models,” Algorithmica 41 (2005), pp J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, “On Graph Problems in a Semi-Streaming Model,” ICALP 2004, pp Journal version to appear in Theoretical Computer Science. M. Elkin and J. Zhang, “Efficient Algorithms for Constructing (1+ε,β)-Spanners in the Distributed and Streaming Models,” PODC 2004, pp J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, “Graph Distances in the Streaming Model: The Value of Space,” SODA 2005, pp

June 15, 2005J. Zhang - Ph.D. Dissertation Defense18 Other Results in Thesis Streaming-space requirement can be reduced by annotating the stream. J. Feigenbaum, S. Kannan, and J. Zhang, “Annotation and Computational Geometry in the Streaming Model,” Yale University Technical Report YALEU/DCS/TR-1249, 2003 Using streaming algorithms to detect BGP-update anomalies. J. Zhang, J. Rexford, and J. Feigenbaum, “Learning-Based Anomaly Detection in BGP Updates,” to appear in SIGCOMM Workshop on Mining Network Data 2005

June 15, 2005J. Zhang - Ph.D. Dissertation Defense19 Talk Outline Streaming computational model Overview of results Approximate graph distances in the streaming model Future research directions

June 15, 2005J. Zhang - Ph.D. Dissertation Defense20 Shortest-Path Distances Distance is the length of the shortest path. Fundamental problem in graph theory Many algorithms and approximations Most of them use BFS-like subroutines, which are hard to adapt to the streaming model.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense21 The “Sketch” Approach A two-stage approach First stage: While going through the stream, construct a small sketch of the input graph. Second stage: Compute the distance using the sketch, without further access to the stream. Perform BFS-like computations in the second stage.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense22 Graph Spanners as Sketches Edge subgraph H of a graph G, s.t., for any pair of vertices u and v, their distance in H, dist H (u,v), is not far from their distance in G, dist G (u,v). Multiplicative spanner [t-Spanner]: dist H (u,v)  t·dist G (u,v). Spanners are sparse. A t-Spanner has O(n 1+1/t ) edges. Reduce streaming graph distance to streaming spanner construction. BFS-like subroutines are used in most existing spanner constructions.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense23 Streaming Spanner Construction For each incoming edge, decide whether it should be in the spanner. If the edge causes a cycle of length  t, do not put the edge in the spanner. This gives a t-spanner, because there is a path P of length < t connecting the two endpoints of any discarded edge. This spanner is sparse. Thm [Bollobás78] : A graph whose girth is larger than k can only have O(n 1+2/(k-1) ) edges. Need to know: For an incoming edge, does the path P exist?

June 15, 2005J. Zhang - Ph.D. Dissertation Defense24 Partial Solution: Clusters (1) A cluster is a subset of vertices and a small diameter spanning tree built on these vertices. Intra-cluster edge

June 15, 2005J. Zhang - Ph.D. Dissertation Defense25 Partial Solution: Clusters (2) Inter-cluster edges Bollobás’s result no longer applies. Need to control the number of clusters (i.e., make it ).

June 15, 2005J. Zhang - Ph.D. Dissertation Defense26 Summary of the One-Pass Algorithm Use a vertex-labeling scheme to construct the clusters. Structure of the algorithm: In the pre-processing phase, generate a multi-level set of labels. Go through the stream; for each edge: According to the current assignment of labels to vertices, decide whether to put this edge in the spanner. Depending on the type of edge, possibly assign more labels to one of its endpoints. Next, an example with t = log n

June 15, 2005J. Zhang - Ph.D. Dissertation Defense27 Labels logn/2 levels w.h.p., there are top-level labels. Semantics of labels: The set of vertices assigned the same top-level label forms a cluster. The set of vertices assigned the same lower-level label forms a “pre-cluster.” (0,1)(0,2)(0,3)(0,4)(0,5)(0,6)(0,7)(0,8)(0,9)(0,10) (0,11) (0,12) (1,2)(1,4)(1,7) (1,11) (2,2)(2,7) Level 0 Level 1 Level 2 (0,1)(0,2)(0,3)(0,4)(0,5)(0,6)(0,7)(0,8)(0,9)(0,10) (0,11) (0,12) (1,2)(1,4)(1,7) (1,11)

June 15, 2005J. Zhang - Ph.D. Dissertation Defense28 Initial Label Assignment v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 (0,1)(0,2)(0,3)(0,4)(0,5)(0,6)(0,7)(0,8)(0,9)(0,10)(0,11)(0,12) (1,2)(1,4)(1,7)(1,11) (2,2)(2,7) Level 0 Level 1 Level 2

June 15, 2005J. Zhang - Ph.D. Dissertation Defense29 On arrival of an edge Already know what to do with: Intra-cluster/pre-cluster edges Inter-cluster edges Edges connecting pre-clusters: the sticky edges They are added to the spanner. They may lead to new label assignment and cluster growth.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense30 “Good” Neighbor (1) (3,2) (2,2) (1,2) (0,2) (1,6) (0,6) (2,2) (3,2) vu Has marked labels

June 15, 2005J. Zhang - Ph.D. Dissertation Defense31 Good Neighbor (2) vu C(1,2) C(2,2) C(3,2) C(1,6)

June 15, 2005J. Zhang - Ph.D. Dissertation Defense32 “Bad” Neighbor (3,2) (1,6) vu No marked labels

June 15, 2005J. Zhang - Ph.D. Dissertation Defense33 Properties of the Clusters Small diameter Number of clusters bounded by. Do not need to cover the whole graph with clusters, but the uncovered subgraph is sparse. The uncovered subgraph consists of sticky edges, and there are not too many of them.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense34 Sticky Edges are Rare u1u1 u2u2 u3u3 u4u4 v u 1, u 2, u 3, u 4 … A neighbor is good with probability at least ½. After seeing at most logn/2 good neighbors, v will be assigned a top- level label and be included in a cluster. No more sticky edges for v. The number of sticky edges can be bounded by the length of the shortest prefix in the above sequence that contains logn/2 good neighbors.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense35 Talk Outline Streaming computational model Overview of results Approximate graph distances in the streaming model Future research directions

June 15, 2005J. Zhang - Ph.D. Dissertation Defense36 Summary We investigated two important problem domains. Exact computation is hard; approximation may be feasible. For some problems, particularly graph problems, considering a more general model is important, because polylog space is too restrictive. Constructing a sketch of non-numerical input is an important tool in streaming-algorithm design.

June 15, 2005J. Zhang - Ph.D. Dissertation Defense37 Future Research Directions Geometric problems: High-dimensional geometric problems Sliding-window with flexible size Graph problems: Dynamic graph problems