Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.

Slides:



Advertisements
Similar presentations
A Fast PTAS for k-Means Clustering
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Support Vector Machine
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Pattern Recognition and Machine Learning
Data Mining Classification: Alternative Techniques
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Machine learning continued Image source:
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Geometric embeddings and graph expansion James R. Lee Institute for Advanced Study (Princeton) University of Washington (Seattle)
Hierarchical Decompositions for Congestion Minimization in Networks Harald Räcke 1.
Classification and Decision Boundaries
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.
Discriminative and generative methods for bags of features
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Coresets and Sketches for High Dimensional Subspace Approximation Problems Morteza Monemizadeh TU Dortmund Joint work with: D. Feldman, C. Sohler, D. Woodruff.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
What is Cluster Analysis?
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Efficient Model Selection for Support Vector Machines
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Collection Depots Facility Location Problems in Trees R. Benkoczi, B. Bhattacharya, A. Tamir 陳冠伶‧王湘叡‧李佳霖‧張經略 Jun 12, 2007.
Data reduction for weighted and outlier-resistant clustering
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
A light metric spanner Lee-Ad Gottlieb. Graph spanners A spanner for graph G is a subgraph H ◦ H contains vertices, subset of edges of G Some qualities.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
Routing Topology Algorithms Mustafa Ozdal 1. Introduction How to connect nets with multiple terminals? Net topologies needed before point-to-point routing.
What is a metric embedding?Embedding ultrametrics into R d An embedding of an input metric space into a host metric space is a mapping that sends each.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
The geometric GMST problem with grid clustering Presented by 楊劭文, 游岳齊, 吳郁君, 林信仲, 萬高維 Department of Computer Science and Information Engineering, National.
Clustering Data Streams A presentation by George Toderici.
Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,
Confidential & Proprietary – All Rights Reserved Internal Distribution, October Quality of Service in Multimedia Distribution G. Calinescu (Illinois.
Clustering Data Streams
Stream-based Geometric Algorithms
Haim Kaplan and Uri Zwick
K Nearest Neighbor Classification
Enumerating Distances Using Spanners of Bounded Degree
Parallel Algorithms for Geometric Graph Problems
Overview Massive data sets Streaming algorithms Regression
COSC 4335: Other Classification Techniques
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams

Komplexitätstheorie und effiziente Algorithmen 2 Data streams Massive data set arriving sequentially Different ways of „arriving“ Examples Network traffic Query logs … Approach Find algorithms that make a single (a few) pass(es) and process data sequentially Introduction

Komplexitätstheorie und effiziente Algorithmen 3 Geometric data streams Massive sets of geometric objects arriving sequentially Objects are typically points Different form of arrival: - sequence of points - sequence of updates Questions Find ways to analyze the geometric structure of the input data using small space Introduction

Komplexitätstheorie und effiziente Algorithmen 4 Motivation Many computational tasks can be interpreted geometrically Geometric features may be useful in learning and classification Geometry plays an important role in the application Examples Learning Clustering How ‚clusterable‘ is a data set? Road traffic prediction Introduction

Komplexitätstheorie und effiziente Algorithmen 5 A basic learning problem We have two classes of objects Introduction

Komplexitätstheorie und effiziente Algorithmen 6 A basic learning problem We have two classes of objects Introduction

Komplexitätstheorie und effiziente Algorithmen 7 A basic learning problem We have two classes of objects We are given examples from both classes Introduction

Komplexitätstheorie und effiziente Algorithmen 8 A basic learning problem We have two classes of objects We are given examples from both classes Introduction

Komplexitätstheorie und effiziente Algorithmen 9 A basic learning problem We have two classes of objects We are given examples from both classes Learn from examples to which class future objects belong Introduction ?

Komplexitätstheorie und effiziente Algorithmen 10 A basic learning problem We have two classes of objects We are given examples from both classes Learn from examples to which class future objects belong Map object‘s description to Euclidean space Introduction ?

Komplexitätstheorie und effiziente Algorithmen 11 A basic learning problem We have two classes of objects We are given examples from both classes Learn from examples to which class future objects belong Map object‘s description to Euclidean space SVM approach Compute maximum margin hyperplane Classifiy points according to their side Introduction ?

Komplexitätstheorie und effiziente Algorithmen 12 SVM and SEB (smallest enclosing balls) Dual of certain SVM formulation is SEB [Tax, Duin, Pattern Recognition Letters, ‘99] Geometric streaming SEB can be used as SVM heuristic [Rai, Daume III, Venkatasubramanian, IJCAI‘09] Also: Coresets have been used to construct CSVMs [Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05] Introduction ?

Komplexitätstheorie und effiziente Algorithmen 13 Outline Merge & Reduce Embeddings into tree metrics Estimation of distribution of local neighborhoods Balanced partitions Approximating properties of balanced partitions Introduction

Komplexitätstheorie und effiziente Algorithmen 14 Insertion-only streams Sequence of points p,…, p from R Merge & Reduce 1 n d

Komplexitätstheorie und effiziente Algorithmen 15 Definition [k-median clustering] Given a weighted set P of points in R the k-median problem is to find a set C  R of k points (centers) such that cost(P,C) =  w  min ||p-c|| is minimized, where w >0 is the weight of point p. Merge & Reduce d pPpP cCcC d p p

Komplexitätstheorie und effiziente Algorithmen 16 Coreset [Har-Peled, Mazumdar, STOC’04] A weighted point set S is a (k,  )-coreset of a weighted point set P, if for every set C of k centers | cost(P,C) – cost(S,C) |   cost(P,C). Merge & Reduce

Komplexitätstheorie und effiziente Algorithmen 17 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 18 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream Coreset

Komplexitätstheorie und effiziente Algorithmen 19 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 20 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream Coreset

Komplexitätstheorie und effiziente Algorithmen 21 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream Coreset of Union of Coreset

Komplexitätstheorie und effiziente Algorithmen 22 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 23 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 24 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 25 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 26 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 27 Observation Union of two (k,  )-coresets is a (k,  )-coreset Can compute coreset of a coreset Merge & Reduce … Input Stream

Komplexitätstheorie und effiziente Algorithmen 28 Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05] Compute a pre-clustering S with >k centers and cost(P,S)    Opt Size exponential in d Merge & Reduce k

Komplexitätstheorie und effiziente Algorithmen 29 Coresets by sampling [Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07] Compute a random non-uniform sample Show that sample approximates all solutions from a net Size polynomial in d Merge & Reduce M M M/4

Komplexitätstheorie und effiziente Algorithmen 30 Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06] Uses geometric arguments to solve 1D Combine with preclusting using line centers For k-median: Size independent of n (but exponential in d) Merge & Reduce

Komplexitätstheorie und effiziente Algorithmen 31 Open problems Coresets for k-median of size independent of n and d ? (Partial result in [Feldman, Monemizadeh, S., SoCG’07] ) Coresets for k-median of size O(d/  ²) Coresets for k-median of size poly(d, log n)/  for constant c=c(d)>0 Coresets for j-subspace 1-median of size poly( , d, j, log n) ? Same questions for k-means objective function Remark: Open questions refer to the definition of coresets from this talk. Merge & Reduce 2-c

Komplexitätstheorie und effiziente Algorithmen 32 Insertion/deletion model Stream consists of Insert(p), Delete(p) operations Points are from {1,…,  } Stream is consistent, i.e. no Delete(p), if p is not present and no Insert(p), if p is already present in the current set Geometric update streams d

Komplexitätstheorie und effiziente Algorithmen 33 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t

Komplexitätstheorie und effiziente Algorithmen 34 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q

Komplexitätstheorie und effiziente Algorithmen 35 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q s r t 2 i 2 i 2 i 2 i

Komplexitätstheorie und effiziente Algorithmen 36 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q 2 i-1 2 i 2 i 2 i q p s r s t r 2 2 2

Komplexitätstheorie und effiziente Algorithmen 37 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q 2 i 2 i 2 i q p s r s t r 2 i r s 2 i-2 2

Komplexitätstheorie und effiziente Algorithmen 38 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics p q r s t t s r p q p q 2 i 2 i 2 i q p s r s t r 2 i r s 2 i-2 2

Komplexitätstheorie und effiziente Algorithmen 39 Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics D(.,.) ||p-q||  D(p,q) E[D(p,q)] = O(log  )  ||p-q|| [Bartal, FOCS’96; Charikar, Chekuri, Goel, Guha, Plotkin, FOCS’98] t s r p q p q 2 i 2 i 2 i q p s r s t r 2 i r s 2 i-2 2

Komplexitätstheorie und effiziente Algorithmen 40 Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04] Write EMST for cost of EMST Write MST for cost of minimum spanning tree of tree metric D E[MST ] = O(log  )  EMST (linearity of expectation) Use cost of MST of D as estimator Streaming algorithms via embeddings into tree metrics D D

Komplexitätstheorie und effiziente Algorithmen 41 Observation [Indyk, STOC’04] The MST of D(.,.) is given by the tree defining the tree metric #edges of length 2 = #non-empty cells in corresponding grid Streaming algorithms via embeddings into tree metrics p q r s t t s r p q p q s r t 2 i 2 i 2 i i 2 i

Komplexitätstheorie und effiziente Algorithmen 42 Euclidean minimum spanning tree 1. Use O(log  nested grids G(i) with side length 2 2. for each grid 3. approximate |G(i)| := #nonempty cells in G(i) using F sketch 4. return  2  |G(i)| Theorem [Indyk, STOC’04] The above algorithm computes a O(log  )-approximation to the cost of the minimum spanning tree. Streaming algorithms via embeddings into tree metrics i i 0

Komplexitätstheorie und effiziente Algorithmen 43 Streaming algorithms via embeddings into tree metrics Results using a similar approach [Indyk, STOC’04] Earth mover‘s distance O(log  ) Facility location O(log²  ) Matching O(log  ) k-MedianO(1) 1+  with huge extraction time Problem Approx. factor

Komplexitätstheorie und effiziente Algorithmen 44 Streaming algorithms via estimating the distribution of local neighborhoods Distribution of neighborhoods Grids G(i) as before R-neighborhood of C: cells within distance at most R from C m (i) is number of points in i-th cell of the R-neighborhood of C C,R A cell and its 2-neighborhood

Komplexitätstheorie und effiziente Algorithmen 45 Streaming algorithms via estimating the distribution of local neighborhoods EMST estimator Define Z (i) = ( m (i) > 0 ) EMST can be approximated from the Z (i) Approx. ratio goes to 1 as R goes to  C,R

Komplexitätstheorie und effiziente Algorithmen 46 Streaming algorithms via estimating the distribution of local neighborhoods EMST estimator K: Size of R-neighborhood Z are functions from {1,…,K} to {0,1} Random (nonempty) C defines distribution over neighborhoods, i.e. over functions Z:{1,…,K}  {0,1} Can still estimate EMST from this distribution C,R

Komplexitätstheorie und effiziente Algorithmen 47 Algorithm Sample a certain number of nonempty grid cells and maintain number of points for each cell in their neighborhood Sample gives estimation of the distribution of the Z (.) Obtain estimation for EMST from estimated distribution Theorem [Frahling, Indyk, S., IJCGA’07] Let  >0, d be constants.The cost of a Euclidean minimum spanning tree of a point set in R given as an update stream can be estimated with a factor of 1  using polylog(  ) space. Streaming algorithms via estimating the distribution of local neighborhoods C,R d

Komplexitätstheorie und effiziente Algorithmen 48 Open Problems (1+  )-approximation for matching and/or earth mover‘s distance Other problems? Approach is not very well understood General characterization of problems solvable via approximation of the distribution of local neighborhoods Streaming algorithms via estimating the distribution of local neighborhoods

Komplexitätstheorie und effiziente Algorithmen 49 Estimating the distribution [Frahling, S., STOC’05] Divide space into regions For each region maintain #points inside Balance „error“ among regions Notion of error depends on problem Example 1-Median in 1D Error  cell width  #points in cell Streaming algorithms via balanced partitions

Komplexitätstheorie und effiziente Algorithmen 50 Small space? Problem dependent Need to show that decomposition in few regions with sufficiently small error exists Streaming algorithms via balanced partitions

Komplexitätstheorie und effiziente Algorithmen 51 One approach [Frahling, S., STOC’05] Nested grids G(i) For each grid maintain cells intersected by random sample (sample sizes differ for different grids) #sample points inside cell -> #points inside cell Combine cells from different grids to space decomposition Streaming algorithms via balanced partitions

Komplexitätstheorie und effiziente Algorithmen 52 Works for k-median k-means MaxTSP, MaxMatching, Maximum spanning tree, Average distance, MaxCut Why? Require proof for k-median and k-means Last 5 problems can be reduced to 1-median Streaming algorithms via balanced partitions

Komplexitätstheorie und effiziente Algorithmen 53 Approximating properties of balanced partitions [Lammersen, S., ESA‘08] Previous approach may lead to many regions Example: facility location Can approximate properties of balanced partitions, e.g. #regions Only gives approximation of cost of solution More details in Christiane‘s talk Streaming algorithms via approximation of balanced partitions

Komplexitätstheorie und effiziente Algorithmen 54 Open problems Min-sum-k-clustering Other problems? Streaming algorithms via balanced partitions

Komplexitätstheorie und effiziente Algorithmen 55 (Some) Techniques in geometric streaming: Merge & Reduce Embeddings into tree metrics Estimation of distribution of local neighborhoods Balanced partitions Approximating properties of balanced partitions And lots of open problems to work on… Summary

Komplexitätstheorie und effiziente Algorithmen 56 Thank you!