Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Slides:



Advertisements
Similar presentations
Analysis of Algorithms CS Data Structures Section 2.6.
Advertisements

Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Analysis of Algorithms CS 477/677
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
1 Lecture 16: Lists and vectors Binary search, Sorting.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Analysis of Algorithms CS 477/677
CSC 211 Data Structures Lecture 13
CSE 2331 / 5331 Topic 12: Shortest Path Basics Dijkstra Algorithm Relaxation Bellman-Ford Alg.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Sub-linear Time Algorithms Ronitt Rubinfeld Tel Aviv University Fall 2015.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
1 Computability Tractable, Intractable and Non-computable functions.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Trees.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Lower Bounds & Sorting in Linear Time
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
The NP class. NP-completeness
Property Testing (a.k.a. Sublinear Algorithms )
Computational Geometry
Probabilistic Algorithms
Integer Programming An integer linear program (ILP) is defined exactly as a linear program except that values of variables in a feasible solution have.
Order Statistics Comp 122, Spring 2004.
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
Randomized Algorithms
Streaming & sampling.
Haim Kaplan and Uri Zwick
COSC160: Data Structures Linked Lists
Week 11 - Friday CS221.
Lecture 18: Uniformity Testing Monotonicity Testing
Chapter 5. Optimal Matchings
MST in Log-Star Rounds of Congested Clique
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Complexity Present sorting methods. Binary search. Other measures.
Parallel and Distributed Algorithms
Multi-Way Search Trees
Randomized Algorithms CS648
Instructor: Shengyu Zhang
Randomized Algorithms
Elementary graph algorithms Chapter 22
Lectures on Graph Algorithms: searching, testing and sorting
Basic Graph Algorithms
CSE373: Data Structures & Algorithms Lecture 5: AVL Trees
Lower Bounds & Sorting in Linear Time
Bipartite Graph 1. A graph G is bipartite if the node set V can be partitioned into two sets V1 and V2 in such a way that no nodes from the same set are.
Introduction to Stream Computing and Reservoir Sampling
General External Merge Sort
Compact routing schemes with improved stretch
Order Statistics Comp 122, Spring 2004.
Clustering.
Elementary graph algorithms Chapter 22
The Selection Problem.
Heaps & Multi-way Search Trees
Submodular Maximization with Cardinality Constraints
Presentation transcript:

Algorithms for Big Data: Streaming and Sublinear Time Algorithms Moran Feldman

Motivation Big Data (huge data sets) Why? So What? The Internet Easy to collect data Easy to transfer data New equipment LHC So What? Difficult to process all the data. Difficult to store all the data.

What is this Talk About? Big data has motivated a lot of research (both CS and non-CS). In this talk we are interested in theoretical algorithms for big data problems. Sublinear time algorithms Streaming Algorithms We will see a few classical algorithms of each one of these kinds.

Sublinear Time Algorithms

Sublinear Time Algorithms Most algorithms read all their input. Require at least a linear time. We are interested in sublinear time algorithms. Cannot afford to read all its input. We will start with a simple example…

Diameter Approximation All distances are non-negative. d(u, v) = 0  u = v d(u, v) = d(v, u) d(u, v) + d(v, w) ≥ d(u, w) Instance A set P of points. A function d: P  P  R giving the distance between every pair of points. We assume d is a metric. Objective Approximate the diameter D of P. u v d(v, w) z w

Algorithm Trivial Algorithm Query the distance between every pair of points, and return the maximum one. Time complexity: O(P2). More Involved Algorithm Fix an arbitrary point u. Query the distance of every other point from u. Return the maximum distance found. The size of the input. u v z w

Analysis A square root of the size of the input. u Time Complexity Time O(P). Guarantee Let d(v, w) be the diameter. By the triangle inequality: d(u, v) + d(u, w)  d(v, w) = D. The algorithm outputs a value D’ such that: D/2  D’  D. This is a rare example of a sublinear time deterministic algorithm. v z w

Property testing We are interested in deciding whether an object has some property. Often depends on all the input: Is a list of numbers sorted? Are all the numbers in a set distinct? Is an image an half-plane? To get the right answer with a constant probability one has to read a constant fraction of the input.   

Property testing (cont.) The exact definition varies. Intuitively, changing a fraction of  of the object cannot make it have the property. Distinguish between two cases Object has the property Answer “Yes” Object is -far from having the property Answer “No” Otherwise Does not matter!

Testing List Sortedness More than   n numbers has to be changed to make the list sorted. Instance A list of numbers of n numbers. Objective Test whether the list is sorted (ascending) or -far from being sorted. Trivial Algorithms Pick a uniformly random 1  i  n – 1, and test whether “xi  xi+1”. Fails with high probability for the ½-far instance: 1111100000 Pick uniformly random 1  i  j  n, and test whether “xi  xj”. Fails with high probability for the ½-far instance: 1032547698

Algorithm [EKKRV00] Pick a uniformly random i. Run a binary search for xi. Answer “No” if the binary search ends up at a point other than i (and “Yes” otherwise). x4 x2 x6 x1 x3 x5 x7 Funda Ergün, Sampath Kannan, S Ravi Kumar, Ronitt Rubinfeld, Mahesh Viswanathan

Completeness Analysis We need to show that the algorithm always returns “Yes” when the input is sorted. Pick a uniformly random i. Run a binary search for xi. Answer “No” if the binary search ends up at a point other than i (and “Yes” otherwise). Should never happen when the list is sorted. (and the elements are unique)

Soundness Analysis Number of Good Indexes Probability = of “Yes” n We need to upper bound the probability that the algorithm returns “Yes” when the list is -far from being sorted. An index i is “good” if the algorithm returns “Yes” when it randomly chooses i. Clearly: Thus, we want to upper bound the number of good idexes in an -far list. Probability of “Yes” Number of Good Indexes = n

Main Observation x4 x2 x6 x1 x3 x5 x7 xk xj xi Lemma The elements at the good indexes form a sorted sub-list. Proof Let i < j be two good indexes. Let k be the index of their lowest common ancestor in the binary search tree. Since i and j are good indexes we get: xi < xk xk < xj x4 x2 x6 x1 x3 x5 x7 xk xj xi

Soundness Analysis (cont.) In an -far list: No (1-)n elements can form a sorted sub-list. There are less than (1-)n good elements. The algorithm answers “Yes” with probability less than: 1 -  Can be improved by repetition. Repeating the algorithm -1 times yields an algorithm that: Always answer “Yes” for a sorted input. Answer “No” for an -far input with probability at least 1/e  0.367. Time complexity: O(-1 log n) Never fails with high probability for a ½-far input.

Streaming Algorithms

Motivation (a) (b) Two Scenarios: A network element Disadvantages Poor random access speed Poor long term reliability: Data has to be copied occasionally. Advantages Cost effective: Hardware Energy Fast sequential access Can be stored offsite: Backup Security Network traffic Magnetic Tape (a) (b) Problem The element can store only a small fraction of the traffic. Processes the traffic: For example, detects malicious activity.

An answer based on the input Streaming Model Algorithm An answer based on the input Input stream Edges of an input graph Words of an input document Main Issue The algorithm should use little memory. Often polylogarithmic in the size of the input. Multiple Passes Sometimes the algorithm is allowed multiple passes over the input. Appropriate for the magnetic tape motivation.

Finding Frequent Elements Theorem [MG82] There is a streaming algorithm using O(k (log n + log m)) space which: Outputs a set of at most k – 1 elements. The set contains every element with more than n/k appearances in the stream. Remarks A second pass can be used to detect the elements that really have more than n/k appearances. For simplicity, we present the algorithm for the case k = 2. Occasionally comes up in job interviews. Misra and Gries

Algorithm Initialize: counter  0. For each arriving element e do If counter = 0 then Set counter  1, candidate  e. ElseIf candidate = e then Set counter  counter + 1. Else Set counter  counter - 1. Return candidate.

Analysis Immediate Observations The algorithm uses O(log n + log m) space. The algorithm outputs a single element. If there is no element that appears more than n/2 times, then we are done. Otherwise, let e1/2 be this element. Definition X is defined as follows: X = counter when the candidate is e1/2. X = -counter when the candidate is not e1/2. Lemma At every given time during the execution of the algorithm: X ≥ (appearances of e1/2 so far) – (appearances of other elements so far).

and left side changes by 1. Proof of the Lemma Lemma At every given time during the execution of the algorithm: Proof The proof is by induction. Trivially holds before the first element arrives. Assume it holds before the arrival of an element e, then: X ≥ (appearances of e1/2 so far) – (appearances of other elements so far). Intuitively, we get an inequality because elements other than e1/2 might cancel each other out. e1/2 is the candidate? Yes No Both sides increase by 1 Both sides increase by 1 and left side changes by 1. e = e1/2 ? Yes No Both sides decrease by 1 Right side decrease by 1

Warping Up the Proof After all the input is processed, we have: X ≥ (appearances of e1/2 so far) – (appearances of other elements so far). X must be positive as well The right side is positive. e1/2 is the final candidate

Streaming Algorithms for Graph Problems Streaming for Graph Problems The stream consists of the edges of the graph. Allows O(n polylog(n)) space. (Trivial) Algorithm for Counting Connected Components Initially each node is an independent connected component. For each edge e that arrives, if the end points of e belong to different connected component, merge these connected components. Sublinear in the length of the stream which can be ϴ(n2).

Algorithm for Counting Connected Components Example Analysis Space complexity: O(n log n). It is enough to maintain the list of nodes in each connected component.

Applications Immediate Application Determining whether a graph is connected. More Interesting Application Determining whether a graph is bipartite. Algorithm Let n1 and n2 be the number of connected components in G and G2. G is bipartite if and only if 2n1 = n2. u v G u1 v1 u2 v2 G2

This lemma implies that the algorithm is correct. Analysis Lemma The copies of the nodes of a connected component C of G form: Two connected components of G2 if C is bipartite. A single connected component of G2 if C is not bipartite. Proof There is never a path in G2 between copies of nodes which are not connected in G. If u and v are connected in G, then each copy of u in G2 is connected to some copy of v. The copies of the nodes of a connected component C of G form one or two connected components in G2. Moreover, in the later case each component contains exactly one copy of each node of C. This lemma implies that the algorithm is correct.

Analysis (cont.) A B A1 B2 A2 B1 C is not bipartite Let v be a node on an odd cycle of C. The cycle becomes a path between v1 and v2 in G2. C is bipartite A path between v1 and v2 in G2 implies an odd cycle in G. Cannot exist since C is bipartite. Alternative view: v a d G b c G2 a2 d1 b1 c2 v2 v1 A B A1 B2 A2 B1

Questions ?