Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.

Slides:

Advertisements

Similar presentations

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Advertisements

Review: Search problem formulation

Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Fast Algorithms For Hierarchical Range Histogram Constructions

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Optimal Fast Hashing Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Hebrew Univ., Israel)

Visual Recognition Tutorial

Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.

Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Data Broadcast in Asymmetric Wireless Environments Nitin H. Vaidya Sohail Hameed.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)

Proteus: Power Proportional Memory Cache Cluster in Data Centers Shen Li, Shiguang Wang, Fan Yang, Shaohan Hu, Fatemeh Saremi, Tarek Abdelzaher.

Collaborative Filtering Matrix Factorization Approach

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

by B. Zadrozny and C. Elkan

Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.

Database Management 9. course. Execution of queries.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Approximating Hit Rate Curves using Streaming Algorithms Nick Harvey Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield TexPoint.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Researchers: Preet Bola Mike Earnest Kevin Varela-O’Hara Han Zou Advisor: Walter Rusin Data Storage Networks.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Functions of Several Variables Copyright © Cengage Learning. All rights reserved.

Optimal XOR Hashing for a Linearly Distributed Address Lookup in Computer Networks Christopher Martinez, Wei-Ming Lin, Parimal Patel The University of.

QoS Routing in Networks with Inaccurate Information: Theory and Algorithms Roch A. Guerin and Ariel Orda Presented by: Tiewei Wang Jun Chen July 10, 2000.

2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

File Processing - Hash File Considerations MVNC1 Hash File Considerations.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

QOS Routing: The Precomputation Perspective Ariel Orda and Alexander Sprintson Presented by: Jing, Niloufer, Tri.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Query Processing CS 405G Introduction to Database Systems.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.

Data Transformation: Normalization

Data Driven Resource Allocation for Distributed Learning

Stochastic Streams: Sample Complexity vs. Space Complexity

International Conference on Data Engineering (ICDE 2016)

A paper on Join Synopses for Approximate Query Answering

Augmented Sketch: Faster and More Accurate Stream Processing

Review Graph Directed Graph Undirected Graph Sub-Graph

Evaluation of Relational Operations

Pyramid Sketch: a Sketch Framework

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Similarity Search: A Matching Based Approach

CUBE MATERIALIZATION E0 261 Jayant Haritsa

Slides adapted from Donghui Zhang, UC Riverside

Approximate Mean Value Analysis of a Database Grid Application

Presentation transcript:

Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh SrivastavaAT&T Labs-Research

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Aggregate Query Over Streams Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP More examples: –Gigascope: A Stream Database for Network Applications (SIGMOD ’ 03). –Holistic UDAFs at Streaming Speed (SIGMOD ’ 04). –Sampling Algorithms in a Stream Operator (SIGMOD ’ 05) (SrcIP, SrcPort, DstIP, DstPort, time, … )

Gigascope All inputs and outputs are streams. Two level structure: LFTA and HFTA. –LFTA / HFTA: Low/High-level Filter Transform and Aggregation. Simple operations in LFTA: –reduce the amount of data sent to HFTA. –fit into L3 cache.

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs – C 1 for probing the hash table in LFTA –C 2 for updating HFTA from LFTA –Bottleneck is the total of C 1 and C 2 cost. LFTAs CountSrcIP HFTAs Single Aggregation

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs – C 1 for probing the hash table in LFTA –C 2 for updating HFTA from LFTA –Bottleneck is the total of C 1 and C 2 cost. LFTAs 12 CountSrcIP HFTAs Single Aggregation

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs – C 1 for probing the hash table in LFTA –C 2 for updating HFTA from LFTA –Bottleneck is the total of C 1 and C 2 cost. LFTAs CountSrcIP HFTAs Single Aggregation

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs – C 1 for probing the hash table in LFTA –C 2 for updating HFTA from LFTA –Bottleneck is the total of C 1 and C 2 cost. LFTAs CountSrcIP HFTAs Single Aggregation

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs – C 1 for probing the hash table in LFTA –C 2 for updating HFTA from LFTA –Bottleneck is the total of C 1 and C 2 cost. LFTAs CountSrcIP HFTAs Single Aggregation

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs – C 1 for probing the hash table in LFTA –C 2 for updating HFTA from LFTA –Bottleneck is the total of C 1 and C 2 cost. LFTAs CountSrcIP HFTAs ( 2, 2 ) Single Aggregation

Select tb, SrcIP, count (*) from IPPackets group by time/60 as tb, SrcIP Example: –2, 24, 2, 17, 12 … –hash by modulo 10 Costs –Probe cost: C 1 for probing the hash table in LFTA. –Eviction cost: C 2 for updating HFTA from LFTA. –Bottleneck is the total of C 1 and C 2 costs. –Evicting everything at the end of each time bucket. LFTAs CountSrcIP HFTAs C1C1 C2C2

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Multiple Aggregations Relation R containing attributes A, B, C 3 Queries –Select tb, A, count(*) from R group by time/60 as tb, A –Select tb, B, count(*) from R group by time/60 as tb, B –Select tb, C, count(*) from R group by time/60 as tb, C Cost: E 1 = n 3 c 1 + 3n x 1 c 2 n: number of records coming in x 1 : collision rate of A, B, C LFTAs HFTAs C2C2 C1C1 A C1C1 B C1C1 C

Alternatively … Maintain a phantom –Total size being the same. Cost: E 2 = nc 1 + 3x 2 nc x 1 ’ x 2 nc 2 x 1 ’ : collision rate of A, B, C x 2 : collision rate of ABC LFTAs HFTAs C2C2 C1C1 A B C ABC C1C1 C1C1 C1C1 phantom

Cost Comparison Without phantom: E 1 = 3nc 1 + 3x 1 nc 2 With phantom E 2 = nc 1 + 3x 2 nc 1 + 3x 1 ’ x 2 nc 2 Difference E 1 -E 2 =[(2-3x 2 )c 1 + 3(x 1 -x 1 ’ x 2 )c 2 ]n If x 2 is small, then E 1 - E 2 > 0.

More Phantoms Relation R contains attributes A, B, C, D. Queries: group by AB, BC, BD, CD Relation feeding graph

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Problem definition Constraint: Given fixed size of memory M. –Guarantee low loss rate when evicting everything at the end of time window –Size should be small to fit in L3 cache –Hardware (the network card) memory size limit. Problems: –1) Phantom choosing. Configuation: a set of queries and phantoms. –2) Space allocation. x ∝ g/b Objective: Minimize the cost.

The View Materialization Problem psc 6M ps 0.8Mpc 6Msc 6M p 0.2Ms 0.01Mc 0.1M none 1

Differences View Materialization Problem Multi-aggregation problem If a view is materialized, it uses a fixed size of space. If a phantom is maintained, it can use a flexible size of space. The smaller the space used, the higher the collision rate of the hash table. Materializing a view is always beneficial. Maintaining a phantom is not always beneficial. High collision rate hash tables increase the overall cost.

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately.

Algorithmic Strategies Brute-force: try all possibilities of phantom combinations and all possibilities of space allocation –Too expensive. Greedy by increasing space used (hint: x ≈ g/b, see analysis later) –b =φg, φ is large enough to guarantee a low collision rate. Greedy by increasing collision rate (our proposal) –modeling the collision rate accurately. Jump

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Collision Rate Model Random data distribution –n rg : expected number of records in a group –k : number of groups hashing to a bucket –n rg k: number of records hashing to a bucket –Random hash: probability of collision 1 – 1/k –n rg k(1-1/k): number of collisions in the bucket –g : total number of groups –b : total number of buckets, where Clustered data distribution –l a : average flow length

The Low Collision Rate Part Phantom is beneficial only when the collision rate is low, therefore the low collision rate part of the collision rate curve is of interest. Linear regression:

Space Allocation: The Two-level case One phantom R 0 feeding all queries R 1, R 2, …, R f. Their hash tables ’ collision rates are x 0, x 1, …, x f. Result: quadratic equation. Let partial derivative of e over b i equal 0.

Space Allocation: General cases Resulted in equations of order higher than 4, which are un solvable algebraically (Abel ’ s Theorem). Partial results: –b 1 2 is proportional to Heuristics: –Treat the configuration as two-level cases recursively. –Supernode. Implementation: –SL: Supernode with linear combination of the number of groups. –SR: Supernode with square root combination of the number of groups. –PL: Proportional linearly to the number of groups. –PR: Proportional to the square root of the number of groups. –ES: Exhaustive space allocation. Supernode

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

Experiments: space allocation (ABCD(ABC(A BC(B C)) D)) Comparison of space allocation schemes –Queries in red; phantoms in blue. –x-axis: memory constraint ; y-axis: relative error compared to the optimal space allocation. Heuristics –SL: Supernode with linear combination of the number of groups. –SR: Supernode with square root combination of the number of groups. –PL: Proportional linearly to the number of groups. –PR: Proportional to the square root of the number of groups. Result: SL is the best; SL and SR are generally better than PL and PR. (ABCD(AB BCD(BC BD CD)))

Experiments: phantom choosing Heuristics –GCSL: Greedy by increasing Collision rate; allocating space using Supernode with Linear combination of the number of groups. –GCPL: Greedy by increasing Collision rate; allocating space using Proportional Linearly to the number of groups. –GS: Greedy by increasing Space. RecallRecall Results: GCSL is better than GS; GCPL is the lower bound of GS. Comparison of greedy strategies –x-axis: φ ; y-axis: relative cost compared to the optimal cost Phantom choosing process –x-axis: # phantom chosen ; y-axis: relative cost compared to the optimal cost

Experiments: real data Experiments on real data –Actually let the data records stream by the hash tables and calculate the cost. –x-axis: memory constraint ; y-axis: relative cost compared to the optimal cost. Results –GCSL is very close to optimal and always better than GS. –By maintaining phantoms, we reduce the cost up to a factor of 35. GCSL vs. GS Maintaining phantom vs. No phantom

Outline Introduction –Query example and Gigascope –Single aggregation –Multiple aggregations –Problem definition Algorithmic strategies Analysis Experiments Conclusion and future work

We introduced the notion of phantoms (fine granularity aggregation queries) that has the benefit of supporting shared computation. We formulated the MA problem, analyzed its components and proposed greedy heuristics to solve it. Through experiments on both real and synthetic data sets, we demonstrate the effectiveness of our techniques. The cost achieved by our solution is up to 35 times less than that of the existing solution. We are trying to deploy this framework in the real DSMS system.

Questions ?