Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Lecture 10: Search Structures and Hashing
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.
University of Macau Faculty of Science and Technology Programming Languages Architecture SFTW 241 spring 2004 Class B Group 3.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)
Marr CollegeHigher Software DevelopmentSlide 1 Higher Computing Software Development Topic 4: Standard Algorithms.
Introduction toData structures and Algorithms
Advanced Algorithms Analysis and Design
INTRODUCTION TO STATISTICS
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
School of Computing Clemson University Fall, 2012
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Data Indexing Herbert A. Evans.
Hash table CSC317 We have elements with key and satellite data
WDS Transformation.
IGCSE 6 Cambridge Effectiveness of algorithms Computer Science
Lecture – 2 on Data structures
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
Algorithm Analysis CSE 2011 Winter September 2018.
8.1 Sampling Distributions
Introduction to Summary Statistics
Introduction to Summary Statistics
Hash Functions/Review
Computation.
CS313D: Advanced Programming Language
Counting How Many Elements Computing “Moments”
Understanding Randomness
Introduction to Summary Statistics
Introduction to Summary Statistics
Introduction to Summary Statistics
Objective of This Course
Introduction to Summary Statistics
Inferential Statistics
Lesson 15: Processing Arrays
Further Data Structures
K-wise vs almost K-wise permutations, and general group actions
Introduction to Summary Statistics
Algorithm Discovery and Design
Cs212: Data Structures Computer Science Department Lecture 7: Queues.
CSE 2010: Algorithms and Data Structures Algorithms
Probabilistic Map Based Localization
Random Access Files / Direct Access Files
Content Analysis of Text
Introduction to Summary Statistics
Introduction to Stream Computing and Reservoir Sampling
Transaction Properties: ACID vs. BASE
Chapter 5 (through section 5.4)
Chapter 8-2: Sorting in Linear Time
Data Structures and Algorithms Introduction to Pointers
Introduction to Summary Statistics
Heavy Hitters in Streams and Sliding Windows
Analysis of Algorithms
Introduction to Summary Statistics
The Selection Problem.
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms
Arrays.
Computer Basics Picture Review.
Data Structures and Algorithm Analysis Hashing
CSE 326: Data Structures Lecture #14
Design and Analysis of Algorithms
Data Structures and Algorithms CS 244
Presentation transcript:

Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…

With respect to the internet, answering: Which of these is the most popular web search? is a much much easier question than answering: What is the most popular web search?

Let’s assume Google received their engine-search requests via one long data stream that they could read- in in real time… The straightforward solution would be to append new words to an array containing all words that have already been encountered and update a corresponding counter …, “Yo dog”, “Girls gone wild”, “Dog ate chocolate”, … {yo=1, dog=2, girls=1, gone=1, wild=1, ate=1, chocolate=1}

Deciding whether to append the new word or increment a past counter might require an expensive search through the array But more importantly, the size of the array would be astronomical with no maximum cap on memory Image credit: The very Google servers pictured above (trippy right?)

Imagine if the English language was dumbed down to a few words, or better yet… the integers 1 to 9 Also, assume that one number (let’s say 4) had the majority of the number instances. (This means >50% of the numbers are actually 4) With the “majority rule” method we would have two pieces of memory: 1) the most common number up to that point (maj) 2) a ‘counter’ that we associate with that number (count)

The rule is that we increment when we stream across the number stored in memory, and decrement otherwise. Example: … 4 maj=4 count=1 … 4 4 maj=4 count=2 … maj=4 count=1 … maj=4 count= maj = 3 count=1

In this case, if 4 had actually been the majority, maj would have =4 when the stream was complete. Method is guaranteed to find the majority if there is one, but the number stored in memory at algorithm completion is not guaranteed to be a number with >50% of the occurrences Extend this to use an m number of maj variables to find the n/(m+1) frequency. Example: use m=99 to find if a word appears in 1% of web search queries. Actually pretty robust!

Going back to the original straightforward method of appending to a huge array… what if we just removed the most infrequent elements every once in a while? This solution gives very good results, but we still have the unbounded space problem. This (along with Majority Rule) illustrates that we will not get the correct answer 100% of the time if we must obey the constant-space rule. But is that really all that bad?

A uniform random distribution actually has expected statistical properties (much like the standard normal distribution) A method used in computer science called “hashing” essentially bins and scrambles values that come from a unpredictable distribution to make them appear as if they are uniformly distributed. The bins can then be analyzed statistically to make generalizations about the data stream

You’ll always be Number 1 in my book, even though the 90’s misses you. Reference: Hayes, Brian. “The Britney Spears problem.” American Scientist 96.4 (2008): 274.