Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009

Outline Summarizing Data Streams. Efficient Array Partitioning. 1D Case. 2D Case. Hierarchical Synopses with Optimal Error Guarantees.

Summarizing Data Streams Approximate a sequence [d 1, d 2, …, d n ] with B buckets, s i = [b i, e i, v i ] so that an error metric is minimized. Data arrive as a stream: Seen only once. Cannot be stored. Objective functions: Max. abs. error: Euclidean error:

Histograms [KSM 2007] Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ][ 16 ][ 4.5 ][… Generalized to any weighted maximum-error metric. Each value d i defines a tolerance interval Bucket closed when running intersection of interval becomes null Complexity:

Histograms Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε Complexity: For error values requiring space, with actual error, run an optimality test: Error-bounded algorithm running under constraint instead of If requires space, then optimal solution has been reached. Independent of buckets B What about streaming case?

Streamstrapping [Guha 2009] Run multiple algorithms. Metric error satisfies property: 1.Read first B items, keep reading until first error (>1/M) 2. Start versions for 3.When a version for some fails, a) Terminate all versions for b) Start new versions for using summary of as first input. 4. Repeat until end of input.

Streamstrapping [Guha 2009] Proof: Theorem: For any StreamStrap algorithm achieves an approximation, running copies and initializations. Consider lowest value of for which an algorithm runs. Suppose error estimate was raised j times before reaching X i : prefix of input just before error estimate was raised for i th time. Y j : suffix between (j-1) th and j th raising of error estimate. H i : summary built for X i. Then: Furthermore: Error estimate is raised by at every time. target erroradded error recursion

Streamstrapping [Guha 2009] Proof (cont’d): Putting it all together, telescoping: Total error is: Moreover, However, (algorithm failed for it) Thus, In conclusion, total error is # Initializations follows. added error optimal error

Streamstrapping [Guha 2009] Proof: Theorem: Algorithm runs in space and time. Space bound follows from copies. Batch input values in groups of Define binary tree of t values, compute min & max over tree nodes: Using tree, max & min of any interval computed in Every copy has to check violation of its bound over t items. Non-violation decided in O(1). Total Violation located in. For all buckets, Over all algorithms it becomes:

1D Array Partitioning [KMS 1997] Problem: Partition an array of n items into p intervals so that the maximum weight of the intervals is minimized. Arises in load balancing in pipelined, parallel environments.

1D Array Partitioning [KMS 1997] Idea: Perform binary search on all possible O(n 2 ) intervals responsible for maximum weight result (bottlenecks). Obstacle: Approximate median has to be calculated in O(n) time.

1D Array Partitioning [KMS 1997] Solution: Exploit internal structure of O(n 2 ) intervals. n columns, column c consisting of Monotonically non-increasing

1D Array Partitioning [KMS 1997] Calls to F(...) need O(1). (why?) Median of any subcolumn determined with one call to F oracle. (how?) Splitter-finding Algorithm: Find median weight in each active subcolumn. Find median of medians m in O(n) (standard). C l (C r ): set of columns with median ) m.

1D Array Partitioning [KMS 1997] The median of medians m is not always a splitter.

1D Array Partitioning [KMS 1997] If median of medians m is not a splitter, recur to set of active subcolumns (C l or C r ) with more elements (ignored elements still considered in future set size calculations). Otherwise, return m as a good splitter (approximate median). End of Splitter-finding Algorithm.

1D Array Partitioning [KMS 1997] Overall Algorithm: 1.Arrange intervals in subcolumns. 2.Find a splitter weight m of active subcolumns. 3.Check whether array is partitionable in p intervals of maximum weight m (how?) 4.If true, then m is upper bound of optimal maximum weight, eliminate half of elements of each subcolumn in C l - otherwise in C r. 5.Recur until convergence to optimal m. Complexity: O(n log n)

2D Array Partitioning [KMS 1997] Problem: Partition a 2D array of n x n items into a p x p partition (inducing p 2 blocks) so that the maximum weight of the blocks is minimized. Arises in particle-in-cell computations, sparse matric computations, etc. NP-hard [GM 1996] APX-hard [CCM 1996]

2D Array Partitioning [KMS 1997] Definition: Two axis-parallel rectangles are independent if their projections are disjoint along both the x-axis and the y-axis. Observation 1: If an array has a partition, then it may contain at most independent rectangles of weight strictly greater than W. (why?)

2D Array Partitioning [KMS 1997] At least one line needed to stab each of the independent rectangles. Best case: independent rectangles

2D Array Partitioning [KMS 1997] The Algorithm: Assume we know optimal W. Step 1: (define P ) Given W, obtain partition such that each row/column within any block has weight at most 2W. (how?) Independent horizontal/vertical scans, keeping track of running sum of weights of each row/column in block. (why exists ?)

2D Array Partitioning [KMS 1997] Step 2: (from P to S ) Construct set of all minimal rectangles of weight more than W, entirely contained in blocks of. (how?) Start from each location within block, consider all possible rectangles in order of increasing sides, until W exceeded, keep minimal ones. Property of S : block weight at most 3W. (why?) Hint : rows/columns in blocks of P at most 2W.

2D Array Partitioning [KMS 1997] Step 3: (from S to M ) Determine local 3-optimal set of independent rectangles. 3-optimality : There does not exist set of independent rectangles in that, added to after removing rectangles from it, do not violate independence condition. Polynomial-time construction (how? with swaps: local optimality easy)

2D Array Partitioning [KMS 1997] Step 4: (from M to new partition) For each rectangle in M, set two straddling horizontal and two straddling vertical lines that induce it. At most partition derived New partition: P from step 1 together with this. horizontal lines vertical lines

2D Array Partitioning [KMS 1997] Step 5: (final) Retain every th horizontal line, every th vertical line. Maximum weight increased at most by

2D Array Partitioning [KMS 1997] Analysis: We have to show that: a.Given W (large enough) such that there exists partition, the maximum block weight in constructed partition is b.Minimum W for which analysis holds (found by binary search) is upper bound to optimum W.

2D Array Partitioning [KMS 1997] Lemma 1: (at Step 1) Let block b contained in partition P. If b exceeds 27W, then b can be partitioned in 3 independent rectangles of weight >W. Proof: Vertical scan in b, cut as soon as seen slab weight exceeds 7W. (hence slab weight < 9W ) (why?) Horizontal scan, cut as soon as one seen slab weight exceeds W.

2D Array Partitioning [KMS 1997] Proof (cont’d): Slab weight exceeding W does not exceed 3W. (why?) Eventually, 3 rectangles weighting >W each.

2D Array Partitioning [KMS 1997] Lemma 2: (at Step 4) Weight of any block of Step-4-partition is Proof: Case 1: Weight of b is O(W). (recall block in S <3W ) Case 2: Weight of b is <27W. If >27W, then b partitionable in 3 independent rectangles, which can substitute the at most 2 blocks in M non-independent of b: violates 3-optimality of M.

2D Array Partitioning [KMS 1997] Lemma 3: (at Step 3) If, then Proof: Weight of rectangles in M is >W. By Observation 1, at most independent rectangles can be contained in M.

2D Array Partitioning [KMS 1997] Lemma 4: (at Step 5) If, weight of any block in final solution is Proof: At Step 5, maximum weight increased at most by By Lemma 2, maximum weight is Hence, final weight is (a) Least W for which Step 1 and Step 3 succeed exceeds optimum W. Found by binary search. (b)

Compact Hierarchical Histograms Assign arbitrary values to CHH coefficients, so that a maximum- error metric is minimized. Heuristic solutions: Reiss et al. VLDB 2006 c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 d3d3 d2d2 d1d1 d0d0 time space The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006]

Compact Hierarchical Histograms Solve the error-bounded problem. Next-to-bottom level case cici c 2i c 2i+1 cici c 2i

Compact Hierarchical Histograms Solve the error-bounded problem. General, recursive case Complexity: (space-efficient) time space Apply to the space-bounded problem. Complexity: Polynomially Tractable

References 1.P. Karras, D. Sacharidis, N. Mamoulis: Exploiting duality in summarization with deterministic guarantees. KDD 2007. 2.S. Guha: Tight results for clustering and summarizing data streams. ICDT 2009. 3.S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array Partitioning. ICALP 1997. 4.F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006. 5.P. Karras, N. Mamoulis: Hierarchical synopses with optimal error guarantees. ACM TODS 33(3): 2008.

Thank you! Questions?

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Similar presentations

Presentation on theme: "Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Similar presentations

Presentation on theme: "Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009."— Presentation transcript:

Similar presentations

About project

Feedback