Presentation is loading. Please wait.

Presentation is loading. Please wait.

Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.

Similar presentations


Presentation on theme: "Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams."— Presentation transcript:

1 Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

2 Komplexitätstheorie und effiziente Algorithmen 2 Data streams Massive data set arriving sequentially Different ways of „arriving“ Examples Network traffic Query logs … Approach Find algorithms that make a single (a few) pass(es) and process data sequentially Introduction

3 Komplexitätstheorie und effiziente Algorithmen 3 Geometric data streams Massive sets of geometric objects arriving sequentially Objects are typically points Different form of arrival: - sequence of points - sequence of updates Questions Find ways to analyze the geometric structure of the input data using small space Introduction

4 Komplexitätstheorie und effiziente Algorithmen 4 Motivation Many computational tasks can be interpreted geometrically Geometric features may be useful in learning and classification Geometry plays an important role in the application Examples Learning Clustering How ‚clusterable‘ is a data set? Road traffic prediction Introduction

5 Komplexitätstheorie und effiziente Algorithmen 5 A basic learning problem We have two classes of objects Introduction

6 Komplexitätstheorie und effiziente Algorithmen 6 A basic learning problem We have two classes of objects Introduction

7 Komplexitätstheorie und effiziente Algorithmen 7 A basic learning problem We have two classes of objects We are given examples from both classes Introduction

8 Komplexitätstheorie und effiziente Algorithmen 8 A basic learning problem We have two classes of objects We are given examples from both classes Introduction

9 Komplexitätstheorie und effiziente Algorithmen 9 A basic learning problem We have two classes of objects We are given examples from both classes Learn from examples to which class future objects belong Introduction ?

10 Komplexitätstheorie und effiziente Algorithmen 10 A basic learning problem We have two classes of objects We are given examples from both classes Learn from examples to which class future objects belong Map object‘s description to Euclidean space Introduction ?

11 Komplexitätstheorie und effiziente Algorithmen 11 A basic learning problem We have two classes of objects We are given examples from both classes Learn from examples to which class future objects belong Map object‘s description to Euclidean space SVM approach Compute maximum margin hyperplane Classifiy points according to their side Introduction ?

12 Komplexitätstheorie und effiziente Algorithmen 12 SVM and SEB (smallest enclosing balls) Dual of certain SVM formulation is SEB [Tax, Duin, Pattern Recognition Letters, ‘99] Geometric streaming SEB can be used as SVM heuristic [Rai, Daume III, Venkatasubramanian, IJCAI‘09] Also: Coresets have been used to construct CSVMs [Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05] Introduction ?

13 Komplexitätstheorie und effiziente Algorithmen 13 Outline Merge & Reduce Embeddings into tree metrics Estimation of distribution of local neighborhoods Balanced partitions Approximating properties of balanced partitions Introduction

14 Komplexitätstheorie und effiziente Algorithmen 14 Insertion-only streams Sequence of points p,…, p from R Merge & Reduce 1 n d

15 Komplexitätstheorie und effiziente Algorithmen 15 Definition [k-median clustering] Given a weighted set P of points in R the k-median problem is to find a set C  R of k points (centers) such that cost(P,C) =  w  min ||p-c|| is minimized, where w >0 is the weight of point p. Merge & Reduce d pPpP cCcC d p p

16 Komplexitätstheorie und effiziente Algorithmen 16 Coreset [Har-Peled, Mazumdar, STOC’04] A weighted point set S is a (k,  )-coreset of a weighted point set P, if for every set C of k centers | cost(P,C) – cost(S,C) |   cost(P,C). Merge & Reduce 3 3 3 3 3 4 4

17 Komplexitätstheorie und effiziente Algorithmen 17 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

18 Komplexitätstheorie und effiziente Algorithmen 18 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream Coreset

19 Komplexitätstheorie und effiziente Algorithmen 19 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

20 Komplexitätstheorie und effiziente Algorithmen 20 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream Coreset

21 Komplexitätstheorie und effiziente Algorithmen 21 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream Coreset of Union of Coreset

22 Komplexitätstheorie und effiziente Algorithmen 22 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

23 Komplexitätstheorie und effiziente Algorithmen 23 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

24 Komplexitätstheorie und effiziente Algorithmen 24 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

25 Komplexitätstheorie und effiziente Algorithmen 25 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

26 Komplexitätstheorie und effiziente Algorithmen 26 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

27 Komplexitätstheorie und effiziente Algorithmen 27 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

28 Komplexitätstheorie und effiziente Algorithmen 28 Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05] Compute a pre-clustering S with >k centers and cost(P,S)    Opt Size exponential in d Merge & Reduce 3 3 3 3 3 4 4 k

29 Komplexitätstheorie und effiziente Algorithmen 29 Coresets by sampling [Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07] Compute a random non-uniform sample Show that sample approximates all solutions from a net Size polynomial in d Merge & Reduce M M M/4

30 Komplexitätstheorie und effiziente Algorithmen 30 Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06] Uses geometric arguments to solve 1D Combine with preclusting using line centers For k-median: Size independent of n (but exponential in d) Merge & Reduce

31 Komplexitätstheorie und effiziente Algorithmen 31 Open problems Coresets for k-median of size independent of n and d ? (Partial result in [Feldman, Monemizadeh, S., SoCG’07] ) Coresets for k-median of size O(d/  ²) Coresets for k-median of size poly(d, log n)/  for constant c=c(d)>0 Coresets for j-subspace 1-median of size poly( , d, j, log n) ? Same questions for k-means objective function Remark: Open questions refer to the definition of coresets from this talk. Merge & Reduce 2-c

32 Komplexitätstheorie und effiziente Algorithmen 32 Insertion/deletion model Stream consists of Insert(p), Delete(p) operations Points are from {1,…,  } Stream is consistent, i.e. no Delete(p), if p is not present and no Insert(p), if p is already present in the current set Geometric update streams d

33 Komplexitätstheorie und effiziente Algorithmen 33 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t

34 Komplexitätstheorie und effiziente Algorithmen 34 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q

35 Komplexitätstheorie und effiziente Algorithmen 35 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q s r t 2 i 2 i 2 i 2 i

36 Komplexitätstheorie und effiziente Algorithmen 36 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q 2 i-1 2 i 2 i 2 i q p s r s t r 2 2 2

37 Komplexitätstheorie und effiziente Algorithmen 37 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q 2 i 2 i 2 i q p s r s t r 2 i-1 2 2 r s 2 i-2 2

38 Komplexitätstheorie und effiziente Algorithmen 38 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q 2 i 2 i 2 i q p s r s t r 2 i-1 2 2 r s 2 i-2 2

39 Komplexitätstheorie und effiziente Algorithmen 39 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics D(.,.) ||p-q||  D(p,q) E[D(p,q)] = O(log  )  ||p-q|| [Bartal, FOCS’96; Charikar, Chekuri, Goel, Guha, Plotkin, FOCS’98] t s r p q p q 2 i 2 i 2 i q p s r s t r 2 i-1 2 2 r s 2 i-2 2

40 Komplexitätstheorie und effiziente Algorithmen 40 Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04] Write EMST for cost of EMST Write MST for cost of minimum spanning tree of tree metric D E[MST ] = O(log  )  EMST (linearity of expectation) Use cost of MST of D as estimator Streaming algorithms via embeddings into tree metrics D D

41 Komplexitätstheorie und effiziente Algorithmen 41 Observation [Indyk, STOC’04] The MST of D(.,.) is given by the tree defining the tree metric #edges of length 2 = #non-empty cells in corresponding grid Streaming algorithms via embeddings into tree metrics p q r s t t s r p q p q s r t 2 i 2 i 2 i i 2 i

42 Komplexitätstheorie und effiziente Algorithmen 42 Euclidean minimum spanning tree 1. Use O(log  nested grids G(i) with side length 2 2. for each grid 3. approximate |G(i)| := #nonempty cells in G(i) using F sketch 4. return  2  |G(i)| Theorem [Indyk, STOC’04] The above algorithm computes a O(log  )-approximation to the cost of the minimum spanning tree. Streaming algorithms via embeddings into tree metrics i i 0

43 Komplexitätstheorie und effiziente Algorithmen 43 Streaming algorithms via embeddings into tree metrics Results using a similar approach [Indyk, STOC’04] Earth mover‘s distance O(log  ) Facility location O(log²  ) Matching O(log  ) k-MedianO(1) 1+  with huge extraction time Problem Approx. factor

44 Komplexitätstheorie und effiziente Algorithmen 44 Streaming algorithms via estimating the distribution of local neighborhoods Distribution of neighborhoods Grids G(i) as before R-neighborhood of C: cells within distance at most R from C m (i) is number of points in i-th cell of the R-neighborhood of C 123 45678 910111213 1415161718 192021 C,R A cell and its 2-neighborhood

45 Komplexitätstheorie und effiziente Algorithmen 45 Streaming algorithms via estimating the distribution of local neighborhoods EMST estimator Define Z (i) = ( m (i) > 0 ) EMST can be approximated from the Z (i) Approx. ratio goes to 1 as R goes to  C,R

46 Komplexitätstheorie und effiziente Algorithmen 46 Streaming algorithms via estimating the distribution of local neighborhoods EMST estimator K: Size of R-neighborhood Z are functions from {1,…,K} to {0,1} Random (nonempty) C defines distribution over neighborhoods, i.e. over functions Z:{1,…,K}  {0,1} Can still estimate EMST from this distribution C,R

47 Komplexitätstheorie und effiziente Algorithmen 47 Algorithm Sample a certain number of nonempty grid cells and maintain number of points for each cell in their neighborhood Sample gives estimation of the distribution of the Z (.) Obtain estimation for EMST from estimated distribution Theorem [Frahling, Indyk, S., IJCGA’07] Let  >0, d be constants.The cost of a Euclidean minimum spanning tree of a point set in R given as an update stream can be estimated with a factor of 1  using polylog(  ) space. Streaming algorithms via estimating the distribution of local neighborhoods C,R d

48 Komplexitätstheorie und effiziente Algorithmen 48 Open Problems (1+  )-approximation for matching and/or earth mover‘s distance Other problems? Approach is not very well understood General characterization of problems solvable via approximation of the distribution of local neighborhoods Streaming algorithms via estimating the distribution of local neighborhoods

49 Komplexitätstheorie und effiziente Algorithmen 49 Estimating the distribution [Frahling, S., STOC’05] Divide space into regions For each region maintain #points inside Balance „error“ among regions Notion of error depends on problem Example 1-Median in 1D Error  cell width  #points in cell Streaming algorithms via balanced partitions

50 Komplexitätstheorie und effiziente Algorithmen 50 Small space? Problem dependent Need to show that decomposition in few regions with sufficiently small error exists Streaming algorithms via balanced partitions

51 Komplexitätstheorie und effiziente Algorithmen 51 One approach [Frahling, S., STOC’05] Nested grids G(i) For each grid maintain cells intersected by random sample (sample sizes differ for different grids) #sample points inside cell -> #points inside cell Combine cells from different grids to space decomposition Streaming algorithms via balanced partitions

52 Komplexitätstheorie und effiziente Algorithmen 52 Works for k-median k-means MaxTSP, MaxMatching, Maximum spanning tree, Average distance, MaxCut Why? Require proof for k-median and k-means Last 5 problems can be reduced to 1-median Streaming algorithms via balanced partitions

53 Komplexitätstheorie und effiziente Algorithmen 53 Approximating properties of balanced partitions [Lammersen, S., ESA‘08] Previous approach may lead to many regions Example: facility location Can approximate properties of balanced partitions, e.g. #regions Only gives approximation of cost of solution More details in Christiane‘s talk Streaming algorithms via approximation of balanced partitions

54 Komplexitätstheorie und effiziente Algorithmen 54 Open problems Min-sum-k-clustering Other problems? Streaming algorithms via balanced partitions

55 Komplexitätstheorie und effiziente Algorithmen 55 (Some) Techniques in geometric streaming: Merge & Reduce Embeddings into tree metrics Estimation of distribution of local neighborhoods Balanced partitions Approximating properties of balanced partitions And lots of open problems to work on… Summary

56 Komplexitätstheorie und effiziente Algorithmen 56 Thank you!


Download ppt "Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams."

Similar presentations


Ads by Google