Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Slides:



Advertisements
Similar presentations
Online Algorithm Huaping Wang Apr.21
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Mining Data Streams.
Indian Statistical Institute Kolkata
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Bloom Filters Kira Radinsky Slides based on material from:
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
D 陳怡安 R 解巽評 R 高榮泰 IEEE/ACM TRANSACTIONS ON NETWORKING OCTOBER 2006 Cristian Estan, George Varghese, Member, IEEE, and Michael Fisk.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Calculating frequency moments of Data Stream
Mining of Massive Datasets Ch4. Mining Data Streams
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
How to Approximate a Set Without Knowing It’s Size In Advance? Rasmus Pagh Gil Segev Udi Wieder IT University of Copenhagen Stanford Microsoft Research.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Mining Data Streams (Part 1)
The Variable-Increment Counting Bloom Filter
Finding Frequent Items in Data Streams
Streaming & sampling.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Counting How Many Elements Computing “Moments”
Spatial Online Sampling and Aggregation
Edge computing (1) Content Distribution Networks
Lecture 2- Query Processing (continued)
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Minwise Hashing and Efficient Search
Presentation transcript:

Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 24, 2007

Ph.D. SeminarUniversity of Alberta2 A sequence of data records Examples –Document/URL streams from a Web crawler –IP packet streams –Web advertisement click streams –Sensor reading streams –... Data stream

Ph.D. SeminarUniversity of Alberta3 One pass processing –Online stream (one scan required) –Massive offline stream (one scan preferred) Challenges –Huge data volume –Fast processing requirement –Relatively small fast storage space Processing in one pass

Ph.D. SeminarUniversity of Alberta4 Approximation algorithms Exact query answers –can be slow to obtain –may need large storage space –sometimes are not necessary Approximate query answers –can take much less time –may need less space –with acceptable errors

Ph.D. SeminarUniversity of Alberta5 Frequency related queries Frequency –# of occurrences Continuous membership query Point query Similarity self-join size estimation

Ph.D. SeminarUniversity of Alberta6 Outline Introduction Continuous membership query –Motivating application –Problem statement –Existing solutions and our solution –Theoretical and experimental results Point query Similarity self-join size estimation Conclusions and future work

Ph.D. SeminarUniversity of Alberta7 A Motivating Application Duplicate URL detection in Web crawling Search engines [Broder et al. WWW03] –Fetch web pages continuously –Extract URLs within each downloaded page –Check each URL (duplicate detection) If never seen before Then fetch it Else skip it

Ph.D. SeminarUniversity of Alberta8 A Motivating Application (cont.) Problems –Huge number of distinct URLs –Memory is usually not large enough –Disks are slow Errors are usually acceptable –A false positive (false alarms) A distinct URL is wrongly reported as a duplicate Consequence: this URL will not be crawled –A false negative (misses) A duplicate URL is wrongly reported as distinct Consequence: this URL will be crawled redundantly or searched on disks

Ph.D. SeminarUniversity of Alberta9 Problem statement A sequence of elements with order Storage space M –Not large enough to store all distinct elements Continuous membership query Appeared before? Yes or No …d g a f b e a d c b a Our goal –Minimize the # of errors –Fast M

Ph.D. SeminarUniversity of Alberta10 An existing solution (caching) Store as many distinct elements as possible in a buffer Duplicate detection process –Upon element arrival, search the buffer –if found then report “duplicate” else “distinct” Update the buffer using some replacement policies –LRU, FIFO, Random, …

Ph.D. SeminarUniversity of Alberta11 Another solution (Bloom filters) A bitmap, originally all “0” Duplicate detection process –Hash each incoming element into some bits –If any bit is “0” then report “distinct” else “duplicate” Update process - sets corresponding bits to “1” x h1(x) h2(x) a 1 2 b 1 3 c 2 4 a

Ph.D. SeminarUniversity of Alberta12 Another solution (Bloom filters, cont.) False positives (false alarms) Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped!

Ph.D. SeminarUniversity of Alberta13 Our solution (Stable Bloom Filters) Kick “elements” out of the Bloom filters Change bits to “cells” (“cellmap”)

Ph.D. SeminarUniversity of Alberta14 Stable Bloom Filters (SBF, cont.) A “cellmap”, originally all “0” Duplicate detection –Hash each element into some cells, check those cells –If any cell is “0”, report “distinct” else “duplicate” Kick “elements” –Randomly choose some cells and deduct them by 1 Update the “cellmap” –Set cells into a predefined value, Max > 0 –Use the same hash functions as in the detection stage

Ph.D. SeminarUniversity of Alberta15 SBF theoretical results SBF will be stable –The expected # of “0”s will become a constant after a number of updates –Converge at an exponential rate –Monotonic False positive rates become constant An upper bound of false positive rates –(a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates) Setting the optimal parameters (partially empirical)

Ph.D. SeminarUniversity of Alberta16 SBF experimental results Experimental comparison between SBF, and Caching/Buffering method (LRU) –URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) To fairly compare, we introduce FPBuffering –Let Caching generate some false positives FPBuffering –If an element is not found in the buffer, report “duplicate” with certain probabilities

Ph.D. SeminarUniversity of Alberta17 SBF experimental results (cont.) SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%)

Ph.D. SeminarUniversity of Alberta18 SBF experimental results (cont.)

Ph.D. SeminarUniversity of Alberta19 SBF experimental results (cont.) MIN, [Broder et al. WWW03], theoretically optimal –assumes “the entire sequence of requests is known in advance” –beats LRU caching by <5% in most cases More false positives allowed, SBF gains more

Ph.D. SeminarUniversity of Alberta20 Outline Introduction Continuous membership query Point query –Motivating application –Problem statement –Existing solutions and our solution –Theoretical and experimental results Similarity self-join size estimation Conclusions and future work

Ph.D. SeminarUniversity of Alberta21 Motivating application Internet traffic monitoring –Query the # of IP packets sent by a particular IP address in the past one hour Phone call record analysis –Query the # of calls to a given phone # yesterday

Ph.D. SeminarUniversity of Alberta22 Problem statement Point query –Summarize a stream of elements –Estimate the frequency of a given element Goal: minimize the space cost and answer the query fast

Ph.D. SeminarUniversity of Alberta23 Existing solutions Fast-AGMS sketch [AMS97, Charikar et al. 2002] Count-min sketch (counting Bloom filters) –e.g. an element is hashed to 4 counters –Take the min counter value as the estimate

Ph.D. SeminarUniversity of Alberta24 Our solution Count-median-mean (CMM) –Count-min based –Take the value of the counter the element is hashed to –Deduct the median/mean value of all other counters –Remainder from deducting the mean is an unbiased estimate (in the case of deducting mean) –Basic idea: all counters are expected to have the same value Example: –counter value = 3 –mean value of all other counters = 2 (median = 2, more robust) –remainder = 1, so frequency estimate = 3-2 =

Ph.D. SeminarUniversity of Alberta25 Theoretical results Unbiased estimate (deduct mean) Estimate variance is the same as that of Fast- AGMS (in the case deducting mean) For less skewed data set – the estimation accuracies of CMM and Fast- AGMS are exactly the same

Ph.D. SeminarUniversity of Alberta26 Experimental results and analysis For skewed data sets – Accuracy (given the same space): CMM-median = Fast-AGMS > CMM-mean Time cost analysis –CMM-mean = Fast-AGMS < CMM-median –but the difference is small Advantage of CMM –More flexible (with estimate upper bound) –More powerful (Count-min can be more accurate for the very skewed data set)

Ph.D. SeminarUniversity of Alberta27 Outline Introduction Continuous membership query Point query Similarity self-join size estimation –Motivating application –Problem statement –Existing solutions and our solution –Theoretical and experimental results Conclusions and future work

Ph.D. SeminarUniversity of Alberta28 Motivating application Near-duplicate document detection for search engines [Broder 99, Henzinger 06] –Very slow (30M pages, 10 days in 1997; 2006?) –Good to predict the time –How? Estimate the number of similar pairs Data cleaning in general (similarity self-join) –To find a better query plan (query optimization) –Estimates of similarity self-join size is needed

Ph.D. SeminarUniversity of Alberta29 Problem statement Similarity self-join size –Given a set of records with d attributes, estimate the # of record pairs that at least s-similar An s-similar pair –A pair of records with s attributes in common –E.g. & are 3-similar

Ph.D. SeminarUniversity of Alberta30 Existing solutions A straightforward solution –Compare each record with all other records –Count the number of pairs at least s-similar –Time cost O(n 2 ) for n records Random sampling –Take a sample of size m uniformly at random –Count the number of pairs at least s-similar –Scale it by a factor of c = n(n-1)/m(m-1)

Ph.D. SeminarUniversity of Alberta31 Our solution Offline SimParCount (Step 1- data processing) –Linearly scan all records once –For each record for each k=s…d Randomly pick k different attribute values, and concatenate them into one k-super-value Repeat this process l_k times –Look at all k-super-values as a stream –Store the (d-s+1) super-value streams on disks

Ph.D. SeminarUniversity of Alberta32 Our solution (cont.) Offline SimParCount (Step 2 - Result generating) –Obtain the self-join size of those 1-dimensional super- value streams –Based on the d-s+1 self-join sizes, estimate the similarity self-join size Online SimParCount –Use small sketches to estimate stream self-join sizes rather than expensive external sorting

Ph.D. SeminarUniversity of Alberta33 Our solution (cont.) Key idea –Convert similarity self-join size estimation to stream self-join size estimation –A similar record pair will have certain chance to have a match in the super-value stream records super-values --- …

Ph.D. SeminarUniversity of Alberta34 Theoretical results Unbiased estimate Standard deviation bound of the estimate Time and space cost (For both offline and online SimParCount)

Ph.D. SeminarUniversity of Alberta35 Experimental results Online SimParCount v.s. Random sampling –Given the same amount of space –Error = (estimate – trueValue) / trueValue –Dataset: DBLP paper titles Each converted into a record with 6 attributes Using min-wise independent hashing

Ph.D. SeminarUniversity of Alberta36 Similarity self-join size estimation – Experimental results (cont.)

Ph.D. SeminarUniversity of Alberta37 Conclusions and future work Streaming algorithms –found real applications (important) –can lead to theoretical results (fun) –More work to be done Current direction: multi-dimensional streaming algorithms E.g Estimating the # of outliers in one pass

Ph.D. SeminarUniversity of Alberta38 Questions/Comments? Thanks!