Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…

Slides:

Advertisements

Similar presentations

Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Lecture 10: Search Structures and Hashing

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.

CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.

University of Macau Faculty of Science and Technology Programming Languages Architecture SFTW 241 spring 2004 Class B Group 3.

INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.

Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)

Marr CollegeHigher Software DevelopmentSlide 1 Higher Computing Software Development Topic 4: Standard Algorithms.

Introduction toData structures and Algorithms

Advanced Algorithms Analysis and Design

INTRODUCTION TO STATISTICS

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

School of Computing Clemson University Fall, 2012

CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)

Data Indexing Herbert A. Evans.

Hash table CSC317 We have elements with key and satellite data

WDS Transformation.

IGCSE 6 Cambridge Effectiveness of algorithms Computer Science

Lecture – 2 on Data structures

The Variable-Increment Counting Bloom Filter

Streaming & sampling.

Algorithm Analysis CSE 2011 Winter September 2018.

8.1 Sampling Distributions

Introduction to Summary Statistics

Introduction to Summary Statistics

Hash Functions/Review

CS313D: Advanced Programming Language

Counting How Many Elements Computing “Moments”

Understanding Randomness

Introduction to Summary Statistics

Introduction to Summary Statistics

Introduction to Summary Statistics

Objective of This Course

Introduction to Summary Statistics

Inferential Statistics

Lesson 15: Processing Arrays

Further Data Structures

K-wise vs almost K-wise permutations, and general group actions

Introduction to Summary Statistics

Algorithm Discovery and Design

Cs212: Data Structures Computer Science Department Lecture 7: Queues.

CSE 2010: Algorithms and Data Structures Algorithms

Probabilistic Map Based Localization

Random Access Files / Direct Access Files

Content Analysis of Text

Introduction to Summary Statistics

Introduction to Stream Computing and Reservoir Sampling

Transaction Properties: ACID vs. BASE

Chapter 5 (through section 5.4)

Chapter 8-2: Sorting in Linear Time

Data Structures and Algorithms Introduction to Pointers

Introduction to Summary Statistics

Heavy Hitters in Streams and Sliding Windows

Analysis of Algorithms

Introduction to Summary Statistics

The Selection Problem.

Introduction to Excel 2007 Part 3: Bar Graphs and Histograms

Computer Basics Picture Review.

Data Structures and Algorithm Analysis Hashing

CSE 326: Data Structures Lecture #14

Design and Analysis of Algorithms

Data Structures and Algorithms CS 244

Presentation transcript:

Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…

With respect to the internet, answering: Which of these is the most popular web search? is a much much easier question than answering: What is the most popular web search?

Let’s assume Google received their engine-search requests via one long data stream that they could read- in in real time… The straightforward solution would be to append new words to an array containing all words that have already been encountered and update a corresponding counter …, “Yo dog”, “Girls gone wild”, “Dog ate chocolate”, … {yo=1, dog=2, girls=1, gone=1, wild=1, ate=1, chocolate=1}

Deciding whether to append the new word or increment a past counter might require an expensive search through the array But more importantly, the size of the array would be astronomical with no maximum cap on memory Image credit: The very Google servers pictured above (trippy right?)

Imagine if the English language was dumbed down to a few words, or better yet… the integers 1 to 9 Also, assume that one number (let’s say 4) had the majority of the number instances. (This means >50% of the numbers are actually 4) With the “majority rule” method we would have two pieces of memory: 1) the most common number up to that point (maj) 2) a ‘counter’ that we associate with that number (count)

The rule is that we increment when we stream across the number stored in memory, and decrement otherwise. Example: … 4 maj=4 count=1 … 4 4 maj=4 count=2 … maj=4 count=1 … maj=4 count= maj = 3 count=1

In this case, if 4 had actually been the majority, maj would have =4 when the stream was complete. Method is guaranteed to find the majority if there is one, but the number stored in memory at algorithm completion is not guaranteed to be a number with >50% of the occurrences Extend this to use an m number of maj variables to find the n/(m+1) frequency. Example: use m=99 to find if a word appears in 1% of web search queries. Actually pretty robust!

Going back to the original straightforward method of appending to a huge array… what if we just removed the most infrequent elements every once in a while? This solution gives very good results, but we still have the unbounded space problem. This (along with Majority Rule) illustrates that we will not get the correct answer 100% of the time if we must obey the constant-space rule. But is that really all that bad?

A uniform random distribution actually has expected statistical properties (much like the standard normal distribution) A method used in computer science called “hashing” essentially bins and scrambles values that come from a unpredictable distribution to make them appear as if they are uniformly distributed. The bins can then be analyzed statistically to make generalizations about the data stream

You’ll always be Number 1 in my book, even though the 90’s misses you. Reference: Hayes, Brian. “The Britney Spears problem.” American Scientist 96.4 (2008): 274.