Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

SEG4110 – Advanced Software Design and Reengineering TOPIC J C++ Standard Template Library.
A Memory-optimized Bloom Filter using An Additional Hashing Function Author: Mahmood Ahmadi, Stephan Wong Publisher: IEEE GLOBECOM 2008 Presenter: Yu-Ping.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Data Structures Using C++ 2E
Mining Data Streams.
Detecting Duplicates over Sliding Windows with RAM-Efficient Detached Counting Bloom Filter Arrays Jiansheng Wei †, Hong Jiang ‡, Ke Zhou † , Dan Feng.
Indian Statistical Institute Kolkata
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Updated QuickSort Problem From a given set of n integers, find the missing integer from 0 to n using O(n) queries of type: “what is bit[j]
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
Bloom Filters Kira Radinsky Slides based on material from:
© 2004 Goodrich, Tamassia Skip Lists1  S0S0 S1S1 S2S2 S3S3    2315.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CSE 373: Data Structures and Algorithms
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Look-up problem IP address did we see the IP address before?
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
A survey on stream data mining
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
1 Secure Indexes Author : Eu-Jin Goh Presented by Yi Cheng Lin.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Sorting and Searching Algorithms Week 11 DSA. Recap etc. Arrays are lists of data 1-D, 2-D etc. Lists associated with searching and sorting Other structures.
Hashing General idea: Get a large array
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Review C++ exception handling mechanism Try-throw-catch block How does it work What is exception specification? What if a exception is not caught?
1 Using Association Rules for Fraud Detection in Web Advertising Networks Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.
Peacock Hash: Deterministic and Updatable Hashing for High Performance Networking Sailesh Kumar Jonathan Turner Patrick Crowley.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Mining of Massive Datasets Ch4. Mining Data Streams
1 Data Structures CSCI 132, Spring 2014 Lecture 33 Hash Tables.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Prof. Amr Goneid, AUC1 CSCE 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 8a. Sorting(1): Elementary Algorithms.
Chapter 11 (Lafore’s Book) Hash Tables Hwajung Lee.
Mining Data Streams (Part 1)
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
The Variable-Increment Counting Bloom Filter
Hashing Alexandra Stefan.
Hashing Alexandra Stefan.
RUM Conjecture of Database Access Method
Presentation transcript:

Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang

Goal of the paper Detect Duplicate click. Approach Sliding window, Landmark window, Jumping windows. Bloom Filters Algorithm.

Sliding Windows Query latest N click in the stream. Queue Structure

Landmark windows Handle disjoint portion of data streams. Landmarks: time or number. Space saving vs. Cross Windows 8:009:0010:0011:0012:0013:00 1,2,3,45,6,7,89,10,1111,12,1313,14,1516,17,18 50clicks

Jumping windows Compromise between landmark windows and sliding windows. Maintain n sub-windows. Populate the latest sub-window and delete the eldest sub- window.

Algorithm for checking the duplicate clicks Basic approach Scan the entire window and compare. Window of size N requires O(N2) comparisons. Basic approach with index Reduce the search cost to O(log(N)) per element. Increase the element insertion cost to O(log(N)). Goal of new algorithm Detect all duplicate. Less space. Quick processing. Few false positive error.

Bit Vector Algorithm Assume unique click with 32bits. Keep a bit vector of length 2^32 bits. 0…..0(32) ….1(32) takes O(1) steps and space to insert a new element into the bit vector, or to check it for duplication. Disadvantage: a larger click with 512 bit.

Bit Vector Algorithm(modification) Keep partial information p bits, 1 ≤ p ≤ b, b is the length of a vector The nth bit vector has bits for all the combinations of bits (n−1)... (n+p−2). There are b−p+1 bit vectors utilized. Bits used:

Potential Problems Probabilistic analysis Any two elements picked at random have a probability of 2^−p that they will collide in any bit vector of length p. a, and b have the same values in bits (n−1) to (n+p−2), then there is a probability of 1/2 that the values of the bits starting at n to (n+p−1) will be equal. Goal Achieve better result. Facilitate the probabilistic analysis.

Bloom Filter General idea Two set X Y, check every element belong to Y. O(|X|) operations, and O(|Y|) space Data Structure Array of M cells. Element y using d independent hash function to addresses y1, y2,..., yd, which are set to 1, such that 0 ≤ yi ≤ M − 1, ∀ i. X use same hash manner. If all set to 1, there is a good probability that x ∈ Y.

Bloom Filter for Duplicate Detection test every new element on the Bloom Filter structure of the previously observed elements, and then insert it into the Bloom Filter structure. The element is not counted as a duplicate if at least 1 bit was switched from 0 to 1, and is considered to be a duplicate otherwise.

Bloom Filter for Sliding Windows Counting Bloom Filters. Delete expired element. Use an integer in a cell represents the number of elements which hash to this cell.

Experiment Landmark and jumping windows Tradeoff between space, time and error rate. The numbers of hash function used.