Review William Cohen. Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch.

Slides:



Advertisements
Similar presentations
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Advertisements

Transformations We want to be able to make changes to the image larger/smaller rotate move This can be efficiently achieved through mathematical operations.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized Algorithms William Cohen. Outline SGD with the hash trick (review) Background on other randomized algorithms – Bloom filters – Locality sensitive.
Randomized / Hashing Algorithms
Segmentation and Fitting Using Probabilistic Methods
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Presented by Relja Arandjelović Iterative Quantization: A Procrustean Approach to Learning Binary Codes University of Oxford 21 st September 2011 Yunchao.
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
Bloom Filters Kira Radinsky Slides based on material from:
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Look-up problem IP address did we see the IP address before?
Navigation Jeremy Wyatt School of Computer Science University of Birmingham.
Hash Tables1 Part E Hash Tables  
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Hash Tables1 Part E Hash Tables  
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
CS 206 Introduction to Computer Science II 03 / 30 / 2009 Instructor: Michael Eckmann.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Robust fitting Prof. Noah Snavely CS1114
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
TinyLFU: A Highly Efficient Cache Admission Policy
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Computer Architecture Lecture 26 Fasih ur Rehman.
The Simigle Image Search Engine Wei Dong
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Error Control Code. Widely used in many areas, like communications, DVD, data storage… In communications, because of noise, you can never be sure that.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Quantum Computing MAS 725 Hartmut Klauck NTU
Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.
Randomized Algorithms Part 3 William Cohen 1. Outline Randomized methods - so far – SGD with the hash trick – Bloom filters – count-min sketches Today:
University of Illinois at Urbana-Champaign
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga William Cohen.
Locality-sensitive hashing and its applications
CS1020 – Data Structures And Algorithms 1 AY Semester 2
Updating SF-Tree Speaker: Ho Wai Shing.
Randomized Algorithms Graph Algorithms
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Graph Algorithms (plus some review….)
Classification with Perceptrons Reading:
Bloom Filters - RECAP.
CS223 Advanced Data Structures and Algorithms
Lecture 11: Nearest Neighbor Search
Advanced Artificial Intelligence
It Probably Works Jerome Vogt.
Locality Sensitive Hashing
CSE 373: Data Structures and Algorithms
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
CS223 Advanced Data Structures and Algorithms
Minwise Hashing and Efficient Search
Lecture 14 Learning Inductive inference
Presentation transcript:

Review William Cohen

Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch applications (event counting for language models, modeling label distributions for graph-based SSL, …) – LSH and random projections on-line LSH pooling and LSH

Bloom filters Interface to a Bloom filter – BloomFilter(int maxSize, double p); – void bf.add(String s); // insert s – bool bd.contains(String s); // If s was added return true; // else with probability at least 1-p return false; // else with probability at most p return true; – I.e., a noisy “set” where you can test membership (and that’s it) don’t memorize this

Bloom filters An implementation – Allocate M bits, bit[0]…,bit[1-M] – Pick K hash functions hash(1,2),hash(2,s),…. E.g: hash(i,s) = hash(s+ randomString[i]) – To add string s: For i=1 to k, set bit[hash(i,s)] = 1 – To check contains(s): For i=1 to k, test bit[hash(i,s)] Return “true” if they’re all set; otherwise, return “false” – We’ll discuss how to set M and K soon, but for now: Let M = 1.5*maxSize // less than two bits per item! Let K = 2*log(1/p) // about right with this M

Bloom filters bf.add(“fred flintstone”): h1h2h3 bf.add(“barney rubble”): h1h2h3

Bloom filters bf.contains (“fred flintstone”): h1h2 bf.contains(“barney rubble”): h1h2h3

Bloom filters Analysis: – Plug optimal k=m/n*ln(2) back into Pr(collision): – Now we can fix any two of p, n, m and solve for the 3 rd : E.g., the value for m in terms of n and p: f(m,n) = implication of analysis

Bloom filters An example application – Finding items in “sharded” data Easy if you know the sharding rule Harder if you don’t (like Google n-grams) Simple idea: – Build a BF of the contents of each shard – To look for key, load in the BF’s one by one, and search only the shards that probably contain key – Analysis: you won’t miss anything, you might look in some extra shards – You’ll hit O(1) extra shards if you set p=1/#shards Another “how would you use it?” question

Bloom Filter  Count-min sketch An implementation – Allocate a matrix CM with d rows, w columns – Pick d hash functions h 1 (s),h 2 (s),…. – To increment counter A[s] for s by c For i=1 to d, set CM[i, hash(i,s)] += c – To retrieve value of A[s]: For i=1 to d, retrieve M[i, hash(i,s)] Return minimum of these values – Similar idea as Bloom filter: if there are d collisions, you return a value that’s too large; otherwise, you return the correct value. Question: what does this look like if d=1?

Key idea: Random projections To make those points “close” we need to project to a direction orthogonal to the line between them

Key idea: Random projections -u-u Any other direction will keep the distant points distant. So if I pick a random r and r.x and r.x’ are closer than γ then probably x and x’ were close to start with.

LSH: key ideas – building LSH from RP Goal: – map feature vector x to bit vector bx – ensure that bx preserves “similarity” Basic idea: use random projections of x – Repeat many times: Pick a random hyperplane r by picking random weights for each feature (say from a Gaussian) Compute the inner product of r with x Record if x is “close to” r (r.x>=0) – the next bit in bx Theory says that is x’ and x have small cosine distance then bx and bx’ will have small Hamming distance

Sessions Randomized algorithms Graph Algorithms – Scalable PageRank – example of consequences of scaling a graph computation – PageRank-Nibble (and use of approximate pagerank) what did you do? what was the motivation for the assignment? – Graph-based SSL (HF, MRW, MAD), graph-based unsupervised learning (Spectral methods and PIC) – examples of using a graph computation for ML what’s the API for these algorithms? how do they work? could you implement them? what are the main differences? – Graph models for computation (Pregel, signal-collect, GraphLab, PowerGraph, GraphX, GraphChi)

Sessions Randomized algorithms Graph Algorithms – Scalable PageRank – Graph-based SSL – Graph models for Generative models (eg LDA) – sampling costs and ways to speed up sampling – parallelization approaches (param server, IPM) – space of generative models: can you extrapolate what you’ve learned about LDA to models for graphs, etc?