2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.

Slides:



Advertisements
Similar presentations
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Advertisements

Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Frequent Closed Pattern Search By Row and Feature Enumeration
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Anany Levitin ACM SIGCSE 1999SIG. Outline Introduction Four General Design Techniques A Test of Generality Further Refinements Conclusion.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.
The Quadratic Formula for solving equations in the form
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Greedy Algo. for Selecting a Join Order The "greediness" is based on the idea that we want to keep the intermediate relations as small as possible at each.
Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs) 
Branch and Bound Searching Strategies
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
Reverse Hashing for High-speed Network Monitoring: Algorithms, Evaluation, and Applications Robert Schweller 1, Zhichun Li 1, Yan Chen 1, Yan Gao 1, Ashish.
Logic Synthesis Outline –Logic Synthesis Problem –Logic Specification –Two-Level Logic Optimization Goal –Understand logic synthesis problem –Understand.
Logic Synthesis 1 Outline –Logic Synthesis Problem –Logic Specification –Two-Level Logic Optimization Goal –Understand logic synthesis problem –Understand.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Solving quadratic equations Factorisation Type 1: No constant term Solve x 2 – 6x = 0 x (x – 6) = 0 x = 0 or x – 6 = 0 Solutions: x = 0 or x = 6 Graph.
Computational Optimization
Solving Quadratic Equations by FACTORING
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Review for EOC Algebra. 1) In the quadratic equation x² – x + c = 0, c represents an unknown constant. If x = -4 is one of the solutions to this equation,
5.6 Complex Numbers. Solve the following quadratic: x = 0 Is this quadratic factorable? What does its graph look like? But I thought that you could.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Department of Electrical Engineering, Southern Taiwan University Robotic Interaction Learning Lab 1 The optimization of the application of fuzzy ant colony.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Copyright © Cengage Learning. All rights reserved. 4 Quadratic Functions.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Comp 335 File Structures Hashing.
5.4 Factor and Solve Polynomial Equations Day 2 Algebra 2.
Greedy Algorithms and Matroids Andreas Klappenecker.
Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.
Scheduling Optimization in Wireless MESH Networks with Power Control and Rate Adaptation SECON 2006 Antonio Capone and Giuliana Carello Keon Jang 2007.
File Processing - Hash File Considerations MVNC1 Hash File Considerations.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
1 Branch and Bound Searching Strategies Updated: 12/27/2010.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
ALGEBRA 1 SECTION 10.4 Use Square Roots to Solve Quadratic Equations Big Idea: Solve quadratic equations Essential Question: How do you solve a quadratic.
© 2010 Pearson Prentice Hall. All rights reserved. CHAPTER 6 Algebra: Equations and Inequalities.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
Grade 8 Pre-Algebra Introduction to Algebra.
Randomized Kinodynamics Planning Steven M. LaVelle and James J
Chapter P Prerequisites: Fundamental Concepts of Algebra 1 Copyright © 2014, 2010, 2007 Pearson Education, Inc. 1 P.7 Equations.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Solving Polynomials.
1 CS 352 Introduction to Logic Design Lecture 4 Ahmed Ezzat Multi-level Gate Circuits and Combinational Circuit Design Ch-7 + Ch-8.
Announcements Topics: -sections 6.4 (l’Hopital’s rule), 7.1 (differential equations), and 7.2 (antiderivatives) * Read these sections and study solved.
The Fundamental Theorem of Algebra and Complete Factorization
Algebra: Equations and Inequalities
Multi-phase process mining
Similarity Search: A Matching Based Approach
Solving Linear Equations and Inequalities
Presentation transcript:

2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005

2006/3/212 Outline Introduction to Giga-Scope DSMS Multiple Aggregations Problem The proposed approach - choice of phantoms - space allocation problem Conclusion

2006/3/213 Giga-Scope A DSMS appears to monitor high speed IP traffic data. LFTA HFTA Main Memory Processing low speed data stream seed by LFTA. Network Interface Card Simple low level query over high speed data stream, which serve to reduce data volumes DSMS

2006/3/214 2,1 24,1 3,1 17,1 2,22,3 4,1 Single Aggregation in Giga- Scope LFTAHFTA (group, count) R Select A, count(*) From R Group by A;

2006/3/215 Cost of Processing a Single Aggregation probe (c1) : The cost of looking up the hash table in LFTAs and possible update in case of a collision eviction (c2) : The cost of transferring an entry from LFTAs to HFTAs

2006/3/216 Processing Multiple Aggregation Naively Select A, count(*) From R Group by A; Select B, count(*) From R Group by B; Select C, count(*) From R Group by C; (2, 3, 4 ) (24, 4, 3) (2, 3, 4) (4, 2, 3) R(A, B, C) LFTAHFTA Hash Table A Hash Table B Hash Table C (2,1) (3,1) (4,1) (24,1) (4,1) (3,1) (2,3) (3,3) (4,3) (2,1) (4,1) (3,2) 15c1 +1c2+7c2 The end of Epoch !!

2006/3/217 Processing Multiple Aggregation by maintaining phantoms R(A, B, C) (2, 3, 4 ) (24, 4, 3) (2, 3, 4) (4, 2, 3) The end of Epoch !! LFTA Hash Table A Hash Table B Hash Table C Hash Table ABC (2, 3 ) (3, 3 ) (4, 3 ) (24, 1 ) (4, 1 ) (2, 1 ) (4, 1 ) (3, 1 ) 14c1 +8c2 HFTA (2, 3, 4, 1 ) (24, 4, 3, 1) (2, 3, 4, 2 ) (4, 2, 3, 1 ) (2, 3, 4, 3 ) (3, 1 )(3, 2 )

2006/3/218 The problem Consider a set of aggregation queries over a data stream that differ only in their group attribute. Determine an optimal sharing setting for the queries with limit memory. ABBCBDCD ABCABDBCD ABCD Q1 Q2 Q3Q4 Given queries -choice of phantoms -space allocation

2006/3/219 Idea by maintaining phantoms : the collision rate without phantoms : collision rate with phantoms : the collision rate of phantom ABC The total cost: –Without phantom : –With the phantom : E1= 3nc 1 +3x 1 nc 2 E2= nc 1 +3x 2 nc 1 +3x 1 ’ x 2 nc 2 x1x1 x1’x1’ x2x2

2006/3/2110 Example A B C ABC C2 C1 In the case, the phantom benefits the cost To be fair,the total space used for the hash tables should be the same with or without the phantoms E1= 3c 1 +3x 1 c 2 E2= c 1 +3x 2 c 1 +3x 1 ’ x 2 c 2 A B C M/3 x1x1 x1’x1’ M/4 E1-E2=(2-3x 2 )c 1 +3(x 1 -x 1 ’ x 2 )c 2 When x 2  0, the phantom benefits the cost. x2x2 C1 x1x1 x1x1 E1-E2=F(x 1, x 2, x 1 ’ )

2006/3/2111 g=3000 b=1000 The probability of k groups out of g hashed to a buckets B k is the number of buckets having k groups n rg :The expected number of record for each group (1-1/k): the collision rate in the bucket :collision happen in the bucket g: number of groups of a relation b: number of buckets in the hash table Key point The collision rate estimation

2006/3/2112 Algorithmic strategies for choosing the phantoms Benefit=the difference between the maintenance costs without or with the phantom. Greedy by Increasing Collision Rate The configuration I only includes all the queries We calculate the maintenance cost if a phantom R is added to I By comparing with the maintenance cost when R is not in I, we can get the benefit After we add this phantom to I,we iterate with the other phantoms As more phantoms are added into I, the overall collision rate goes up and benefit decreases Stop when the benefit becomes negative.

2006/3/2113 Algorithmic strategies for choosing the phantoms Greedy by Increasing Collision Rate ABBCBDCD ABCABDBCD ABCD Q1 Q2 Q3Q4 g=2837 g=2117 g=1846 g=2387g=2249 g=1946g=1899g=1999 Available memory=12000 Allocate AB=(1846/7690)* Allocate BC … Allocate BD … Allocate CD … Try ABCD (Linear proportional Allocation) Allocate ABCD=(2837/10527)*12000 Allocate AB=(1846/10527)*12000 Allocate BC … Allocate BD … Allocate CD … The process ends when benefit become negative E1-E2=F(x 1, x 2, x 1 ’ ) b ABCD  x ABCD  Benefit

2006/3/2114 Space Allocation AB AB By partial derivatives of e to 0. When, e has minimum cost. Thereby, the space allocated is proportional to square root of number of group. Optimal solution for the two level graph x0x0 x1x1 x2x2

2006/3/2115 Algorithmic strategies for choosing the phantoms One way of allocating hash table space to a relation is proportional to the number of groups in the table We can allocate space for a relation with g is a constant and we set it large

2006/3/2116 Algorithmic strategies for choosing the phantoms Greedy by Increasing Space We calculate the benefit of each phantom according to the cost model We calculate the benefit per unit space for each phantom R, benefit/ We choose the phantom with the largest benefit per unit space as the first phantom to instantiate The process ends when the benefit per unit space becomes negative

2006/3/2117 Algorithmic strategies for choosing the phantoms Greedy by Increasing Space ABBCBDCD ABCABDBCD ABCD Q1 Q2 Q3Q4 g=2837 g=2117 g=1846 g=2387g=2249 g=1946g=1899g=1999 E1-E2=(2-3x 2 )c 1 +3(x 1 -x 1 ’ x 2 )c 2 Benefit/Space as a metric Benefit=2 Benefit=1 Benefit=-1 Try ABCD Available memory= = =1473 The process ends when 1.Benefit become negative 2.The space is exhausted

2006/3/2118 Drawback needs to be tuned to find the best performance

2006/3/2119 Space Allocation According to Abel’s impossibility theorem, equations of order higher than 4 cannot be solved algebraically, we say unsolvable More general multi-level configurations generate equations of even higher order which are unsolvable We would use heuristics to decide space allocation for the these unsolvable cases based on the analysis available

2006/3/2120 Space Allocation Super-node with Linear Combination Super-node with Square Root Combination Linear Proportional Allocation Square Root Proportional Allocation

2006/3/2121 Conclusion We address the problem of efficiently computing multiple aggregations over high speed data streams In real DSMS, the value of “g” is unknown.