Small Summaries for Big Data

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Dimensionality Reduction PCA -- SVD
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Sketching for M-Estimators: A Unified Approach to Robust Regression
Graph & BFS.
Heuristic alignment algorithms and cost matrices
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Tirgul 9 Amortized analysis Graph representation.
Connected Components, Directed Graphs, Topological Sort COMP171.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Linear Algebra and Image Processing
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Arrays Tonga Institute of Higher Education. Introduction An array is a data structure Definitions  Cell/Element – A box in which you can enter a piece.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Data Transformation: Normalization
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Computing and Compressive Sensing in Wireless Sensor Networks
Approximating the MST Weight in Sublinear Time
Matrix Sketching over Sliding Windows
Streaming & sampling.
Query-Friendly Compression of Graph Streams
Sublinear Algorithmic Tools 2
CSCI1600: Embedded and Real Time Software
Lecture 4: CountSketch High Frequencies
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CIS 700: “algorithms for Big Data”
The Communication Complexity of Distributed Set-Joins
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
Parallelization of Sparse Coding & Dictionary Learning
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Range-Efficient Computation of F0 over Massive Data Streams
Minwise Hashing and Efficient Search
CSCI1600: Embedded and Real Time Software
Learning-Based Low-Rank Approximations
Presentation transcript:

Small Summaries for Big Data Ke Yi HKUST yike@ust.hk

The case for “Big Data” in one slide “Big” data arises in many forms: Medical data: genetic sequences, time series Activity data: GPS location, social network activity Business data: customer behavior tracking at fine detail Physical Measurements: from science (physics, astronomy) The 3 V’s: Volume Velocity Variety focus of this talk Small Summaries for Big Data

Computational scalability The first (prevailing) approach: scale up the computation Many great technical ideas: Use many cheap commodity devices Accept and tolerate failure Move code to data MapReduce: BSP for programmers Break problem into many small pieces Decide which constraints to drop: noSQL, ACID, CAP Scaling up comes with its disadvantages: Expensive (hardware, equipment, energy), still not always fast This talk is not about this approach! Small Summaries for Big Data

Small Summaries for Big Data Downsizing data A second approach to computational scalability: scale down the data! A compact representation of a large data set Too much redundancy in big data anyway What we finally want is small: human readable analysis / decisions Necessarily gives up some accuracy: approximate answers Often randomized (small constant probability of error) Examples: samples, sketches, histograms, wavelet transforms Complementary to the first approach: not a case of either-or Some drawbacks: Not a general purpose approach: need to fit the problem Some computations don’t allow any useful summary Small Summaries for Big Data

Small Summaries for Big Data Outline for the talk Some most important data summaries Sketches: Bloom filter, Count-Min, Count-Sketch (AMS) Counters: Heavy hitters, quantiles Sampling: simple samples, distinct sampling Summaries for more complex objects: graphs and matrices Current trends and future challenges for data summarization Many abbreviations and omissions (histograms, wavelets, ...) A lot of work relevant to compact summaries In both the database and the algorithms literature Small Summaries for Big Data

Small Summaries for Big Data Summary Construction There are several different models for summary construction Offline computation: e.g. sort data, take percentiles Streaming: summary merged with one new item each step Full mergeability: allow arbitrary merges of partial summaries The most general and widely applicable category Key methods for summaries: Create an empty summary Update with one new tuple: streaming processing Insert and delete Merge summaries together: distributed processing (eg MapR) Query: may tolerate some approximation (parameterized by 𝜀) Several important cost metrics (as function of 𝜀, 𝑛): Size of summary, time cost of each operation Small Summaries for Big Data

Small Summaries for Big Data Sketches Small Summaries for Big Data

Small Summaries for Big Data Bloom Filters [Bloom 1970] Bloom filters compactly encode set membership E.g. store a list of many long URLs compactly 𝑘 random hash functions map items to 𝑚-bit vector 𝑘 times Update: Set all 𝑘 entries to 1 to indicate item is present Query: Is 𝑥 in the set? Return “yes” if all 𝑘 bits 𝑥 maps to are 1. Duplicate insertions do not change Bloom filters Can be merged by OR-ing vectors (of same size) item 1 1 1 Small Summaries for Big Data

Bloom Filters Analysis If item is in the set, then always return “yes” No false negative If item is not in the set, may return “yes” with a probability. Prob. that a bit is 0: 1− 1 𝑚 𝑘𝑛 Prob. that all 𝑘 bits are 1: 1− 1− 1 𝑚 𝑘𝑛 𝑘 ≈ exp (𝑘 ln (1− 𝑒 𝑘𝑛/𝑚 ) Setting 𝑘= 𝑚 𝑛 ln 2 minimizes this probability to 0.6185 𝑚/𝑛 . For false positive prob 𝛿, Bloom filter needs 𝑂(𝑛 log 1 𝛿 ) bits Small Summaries for Big Data

Bloom Filters Applications Bloom Filters widely used in “big data” applications Many problems require storing a large set of items Can generalize to allow deletions Swap bits for counters: increment on insert, decrement on delete If no duplicates, small counters suffice: 4 bits per counter Bloom Filters are still an active research area Often appear in networking conferences Also known as “signature files” in DB item 1 2 Small Summaries for Big Data

Invertible Bloom Lookup Table [GM11] A summary that Support insertions and deletions When # item ≤𝑡, can retrieve all items Also known as sparse recovery Easy case when 𝑡=1 Assume all items are integers Just keep 𝑧= the bitwise-XOR of all items To insert 𝑥: 𝑧←𝑧⊕𝑥 To delete 𝑥: 𝑧←𝑧⊕𝑥 Assumption: no duplicates; can’t delete a non-existing item A problem: How do we know when # items = 1? Just keep another counter Small Summaries for Big Data

Invertible Bloom Lookup Table Structure is the same as a Bloom filter, but each cell is the XOR or all items mapped here also keeps a counter of such items How to recover? First check if there is any cell has only 1 item Recover the item if there is one Delete the item from the all cells, and repeat item 1 2 Small Summaries for Big Data

Invertible Bloom Lookup Table Can show that Using 1.5𝑡 cells, can recover ≤𝑡 items with high probability 2 1 item Small Summaries for Big Data

Small Summaries for Big Data Count-Min Sketch [CM 04] Count Min sketch encodes item counts Allows estimation of frequencies (e.g. for selectivity estimation) Some similarities in appearance to Bloom filters Model input data as a vector 𝑥 of dimension 𝑈 E.g., [0..𝑈] is the space of all IP addresses, 𝑥[𝑖] is the frequency of IP address 𝑖 in the data Create a small summary as an array of 𝑤×𝑑 in size Use 𝑑 hash function to map vector entries to [1..𝑤] 𝑊 Array: 𝐶𝑀[𝑖,𝑗] 𝑑 Small Summaries for Big Data

Count-Min Sketch Structure +𝑐 ℎ 1 (𝑖) ℎ 𝑑 (𝑖) 𝑖,+𝑐 𝑑 rows 𝑤=2/𝜀 Update: each entry in vector 𝑥 is mapped to one bucket per row. Merge two sketches by entry-wise summation Query: estimate 𝑥[𝑖] by taking min𝑘𝐶𝑀[𝑘, ℎ 𝑘 (𝑖)] Never underestimate Will bound the error of overestimation Small Summaries for Big Data

Count-Min Sketch Analysis Focusing on first row: 𝑥 𝑖 is always added to 𝐶𝑀[1, ℎ 1 𝑖 ] 𝑥 𝑗 ,𝑗≠𝑖 is added to 𝐶𝑀[1, ℎ 1 𝑖 ] with prob. 1/𝑤 The expected error is 𝑥 𝑗 /𝑤 The total expected error is 𝑗≠𝑖 𝑥 𝑗 𝑤 ≤ | 𝑥 | 1 𝑤 By Markov inequality, Pr⁡[error> 2 ⋅| 𝑥 | 1 𝑤 ]< 1 2 By taking the minimum of 𝑑 rows, this prob. is 1 2 𝑑 To give an 𝜀 ||𝑥|| 1 error with prob. 1−𝛿, the sketch needs to have size 2 𝜀 × log 1 𝛿 Small Summaries for Big Data

Generalization: Sketch Structures Sketch is a class of summary that is a linear transform of input Sketch(𝑥) = 𝑆𝑥 for some matrix 𝑆 Hence, Sketch(𝑥+𝛽𝑦) =  Sketch(𝑥) + 𝛽 Sketch(𝑦) Trivial to update and merge Often describe 𝑆 in terms of hash functions Analysis relies on properties of the hash functions Some assumes truly random hash functions (e.g., Bloom filter) Don’t exist, but ad hoc hash functions work well for many real-world data Others need hash functions with limited independence. Small Summaries for Big Data

Count Sketch / AMS Sketch [CCF-C02, AMS96] +𝑐 𝑔 1 (𝑖) +𝑐 𝑔 2 (𝑖) +𝑐 𝑔 3 (𝑖) +𝑐 𝑔 𝑑 (𝑖) ℎ 1 (𝑖) ℎ 𝑑 (𝑖) 𝑖,+𝑐 𝑑 rows 𝑤 Second hash function: 𝑔 𝑘 maps each 𝑖 to +1 or −1 Update: each entry in vector 𝑥 is mapped to one bucket per row. Merge two sketches by entry-wise summation Query: estimate 𝑥[𝑖] by taking median𝑘𝐶[𝑘, ℎ 𝑘 𝑖 𝑔 𝑘 𝑖 ] May either overestimate or underestimate Small Summaries for Big Data

Small Summaries for Big Data Count Sketch Analysis Focusing on first row: 𝑥 𝑖 𝑔 1 𝑖 is always added to 𝐶[1, ℎ 1 𝑖 ] We return 𝐶 1, ℎ 1 𝑖 𝑔 1 𝑖 =𝑥[𝑖] 𝑥 𝑗 𝑔 1 𝑗 ,𝑗≠𝑖 is added to 𝐶[1, ℎ 1 𝑖 ] with prob. 1/𝑤 The expected error is 𝑥 𝑗 𝐸[ 𝑔 1 𝑖 𝑔 1 𝑗 ]/𝑤=0 E[total error] = 0, E[|total error|] ≤ | 𝑥 | 2 𝑤 By Chebyshev inequality, Pr⁡[|total error|> 2 ⋅| 𝑥 | 2 𝑤 ]< 1 4 By taking the median of 𝑑 rows, this prob. is 1 2 𝑂(𝑑) To give an 𝜀 ||𝑥|| 2 error with prob. 1−𝛿, the sketch needs to have size 𝑂 1 𝜀 2 log 1 𝛿 Small Summaries for Big Data

Count-Min Sketch vs Count Sketch Size: 𝑂 𝑤𝑑 Error: ||𝑥|| 1 /𝑤 Count Sketch: Size: 𝑂 𝑤𝑑 Error: ||𝑥|| 2 / 𝑤 Other benefits: Unbiased estimator Can also be used to estimate 𝑥⋅𝑦 (join size) Count Sketch is better when ||𝑥|| 2 < ||𝑥|| 1 / 𝑤 Small Summaries for Big Data

Application to Large Scale Machine Learning In machine learning, often have very large feature space Many objects, each with huge, sparse feature vectors Slow and costly to work in the full feature space “Hash kernels”: work with a sketch of the features Effective in practice! [Weinberger, Dasgupta, Langford, Smola, Attenberg ‘09] Similar analysis explains why: Essentially, not too much noise on the important features Small Summaries for Big Data

Counter Based Summaries 1 2 3 4 5 6 7 8 9 Small Summaries for Big Data

Small Summaries for Big Data Heavy hitters [MG82] Misra-Gries (MG) algorithm finds up to 𝑘 items that occur more than 1/𝑘 fraction of the time in a stream Estimate their frequencies with additive error ≤𝑁/𝑘 Equivalently, achieves 𝜀𝑁 error with 𝑂(1/𝜀) space Better than Count-Min but doesn’t support deletions Keep 𝑘 different candidates in hand. To add a new item: If item is monitored, increase its counter Else, if <𝑘 items monitored, add new item with count 1 Else, decrease all counts by 1 𝑘=5 1 2 3 4 5 6 7 8 9 Small Summaries for Big Data

Small Summaries for Big Data Heavy hitters Misra-Gries (MG) algorithm finds up to 𝑘 items that occur more than 1/𝑘 fraction of the time in a stream [MG’82] Estimate their frequencies with additive error ≤𝑁/𝑘 Keep 𝑘 different candidates in hand. To add a new item: If item is monitored, increase its counter Else, if <𝑘 items monitored, add new item with count 1 Else, decrease all counts by 1 𝑘=5 1 2 3 4 5 6 7 8 9 Small Summaries for Big Data

Small Summaries for Big Data Heavy hitters Misra-Gries (MG) algorithm finds up to 𝑘 items that occur more than 1/𝑘 fraction of the time in a stream [MG’82] Estimate their frequencies with additive error ≤𝑁/𝑘 Keep 𝑘 different candidates in hand. To add a new item: If item is monitored, increase its counter Else, if <𝑘 items monitored, add new item with count 1 Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 𝑘=5 Small Summaries for Big Data

Small Summaries for Big Data MG analysis 𝑁 = total input size 𝑀 = sum of counters in data structure Error in any estimated count at most (𝑁−𝑀)/(𝑘+1) Estimated count a lower bound on true count Each decrement spread over (𝑘+1) items: 1 new one and 𝑘 in MG Equivalent to deleting (𝑘+1) distinct items from stream At most (𝑁−𝑀)/(𝑘+1) decrement operations Hence, can have “deleted” (𝑁−𝑀)/(𝑘+1) copies of any item So estimated counts have at most this much error Small Summaries for Big Data

Merging two MG summaries [ACHPZY12] Merging algorithm: Merge two sets of 𝑘 counters in the obvious way Take the (𝑘+1)th largest counter = 𝐶 𝑘+1 , and subtract from all Delete non-positive counters 𝑘=5 1 2 3 4 5 6 7 8 9 Small Summaries for Big Data

Merging two MG summaries This algorithm gives mergeability: Merge subtracts at least 𝑘+1 𝐶 𝑘+1 from counter sums So 𝑘+1 𝐶 𝑘+1  (𝑀1 + 𝑀2 – 𝑀12) Sum of remaining (at most 𝑘) counters is 𝑀12 By induction, error is ((𝑁1−𝑀1)+(𝑁2−𝑀2)+(𝑀1+𝑀2–𝑀12))/(𝑘+1) =((𝑁1+𝑁2) –𝑀12)/(𝑘+1) (prior error) (from merge) 𝑘=5 𝐶 𝑘+1 1 2 3 4 5 6 7 8 9 Small Summaries for Big Data

Quantiles (order statistics) Exact quantiles: 𝐹 −1 () for 0<<1, 𝐹  : CDF Approximate version: tolerate answer between 𝐹 −1 (−)… 𝐹 −1 (+) Dual problem - rank estimation: Given 𝑥, estimate 𝐹(𝑥) Can use binary search to find quantiles Summarizing Disitributed Data

Quantiles gives equi-height histogram Automatically adapts to skew data distributions Equi-width histograms (fixed binning) are is to construct but does not adapt to data distribution Summarizing Disitributed Data

The GK Quantile Summary [GK01] Incoming: 3 N=1 3 node: value & rank

GK Incoming: 14 N=2 3 14 1

We know the exact rank of any element Space: 𝑂(#node) GK Incoming: 50 N=3 3 14 50 1 2 We know the exact rank of any element Space: 𝑂(#node)

GK Incoming: 9 N=4 3 9 14 50 1 1 2 Insertion in the middle: affects other nodes’ ranks

GK N=4 3 9 14 50 1 2 3 Deletion: introducing errors

GK rank(13) = 2 or 3 ? 3 9 50 Incoming: 13 N=5 1 3 1 3 Deletion: introducing errors

GK Incoming: 14 N=5 3 9 13 50 1 2-3 3 node: value & rank bounds

GK Incoming: 14 N=6 3 9 13 14 50 1 2-3 3-4 4 Insertion: does not introduce errors Error retains in the interval

GK 3 9 13 14 50 One comes, one goes N=6 1 2-3 3-4 5 1 2-3 3-4 5 New elements are inserted immediately Try to delete one after each insertion

GK 3 9 13 14 50 Removing a node causes errors N=6 1 2 2 2 2 1 2-3 3-4 1 2-3 3-4 5 Delete the node with smallest error, if it is smaller than 𝜀𝑁

GK 3 9 13 14 50 Return rank(20)=4 The error is at most 1 Query: rank(20) 3 9 13 14 50 1 2-3 3-4 5 Return rank(20)=4 The error is at most 1

The GK Quantile Summary The practical version Space: Open problem! Small in practice A complicated version achieves 𝑂 1 𝜀 log 𝜀𝑁 space Doesn’t support deletions Doesn’t (don’t know how to) support merging There is a randomized mergeable quantile summary

A Quantile Summary Supporting Deletions [WLYC13] Assumption: all elements are integers from [0..U] Use a dyadic decomposition # of elements between U/2 and U-1 U/2 U-1 Small Summaries for Big Data

A Quantile Summary Supporting Deletions rank(x) x U-1

A Quantile Summary Supporting Deletions Consider each layer as a frequency vector, and build a sketch Count Sketch is better than Count-Min due to unbiasedness CS CS CS CS Space: 𝑂 1 𝜀 log 1.5 𝑈

Small Summaries for Big Data Random Sampling Small Summaries for Big Data

Random Sample as a Summary Cost A random sample is cheaper to build if you have random access to data Error-size tradeoff Specialized summaries often have size 𝑂 (1/𝜀) A random sample needs size 𝑂(1/ 𝜀 2 ) Generality Specialized summaries can only answer one type of queries Random sample can be used to answer different queries In particular, ad hoc queries that are unknown beforehand Small Summaries for Big Data

Small Summaries for Big Data Sampling Definitions Sampling without replacement Randomly draw an element Don’t put it back Repeat s times Sampling with replacement Put it back Coin-flip sampling Sample each element with probability 𝑝 Sample size changes for dynamic data The statistical difference is very small, for 𝑁≫𝑠≫1 Small Summaries for Big Data

Small Summaries for Big Data Reservoir Sampling Given a sample of size 𝑠 drawn from a data set of size 𝑁 (without replacement) A new element in added, how to update the sample? Algorithm 𝑁←𝑁+1 With probability 𝑠/𝑁, use it to replace an item in the current sample chosen uniformly at random With probability 1−𝑠/𝑁, throw it away Small Summaries for Big Data

Reservoir Sampling Correctness Proof Many “proofs” found online are actually wrong They only show that each item is sampled with probability 𝑠/𝑁 Need to show that every subset of size 𝑠 has the same probability to be the sample Correct proof relates with the Fisher-Yates shuffle s = 2 a b c d a b c d b a c d b c a d b d a c Small Summaries for Big Data

Merging random samples 𝑆2 𝑆1 + 𝑁1 = 10 𝑁2 = 15 With prob. 𝑁1/(𝑁1+𝑁2), take a sample from 𝑆1 Summarizing Disitributed Data

Merging random samples 𝑆2 𝑆1 + 𝑁1 = 9 𝑁2 = 15 With prob. 𝑁1/(𝑁1+𝑁2), take a sample from 𝑆1 Summarizing Disitributed Data

Merging random samples 𝑆2 𝑆1 + 𝑁1 = 9 𝑁2 = 15 With prob. 𝑁2/(𝑁1+𝑁2), take a sample from 𝑆2 Summarizing Disitributed Data

Merging random samples 𝑆2 𝑆1 + 𝑁1 = 9 𝑁2 = 14 With prob. 𝑁2/(𝑁1+𝑁2), take a sample from 𝑆2 Summarizing Disitributed Data

Merging random samples 𝑆2 𝑆1 + 𝑁1 = 8 𝑁2 = 12 With prob. 𝑁2/(𝑁1+𝑁2), take a sample from 𝑆2 𝑁3 = 25 Summarizing Disitributed Data

Random Sampling with Deletions Naïve method: Maintain a sample of size 𝑀>𝑠 Handle insertions as in reservoir sampling If the deleted item is not in the sample, ignore it If the deleted item is in the sample, delete it from the sample Problem: the sample size may drop to below 𝑠 Some other algorithms in the DB literature but sample size still drops when there are many deletions Small Summaries for Big Data

Min-wise Hashing / Sampling For each item 𝑥, compute ℎ 𝑥 Return the 𝑠 items with the smallest values of ℎ(𝑥) as the sample Assumption: ℎ is a truly random hash function A lot of research on how to remove this assumption Small Summaries for Big Data

Small Summaries for Big Data 𝑳 𝟎 -Sampling [FIS05, CMR05] p=1/U 2𝑠-sparse recovery p=1 Suppose ℎ 𝑥 ∈[0..𝑢−1] and always has log 𝑢 bits (pad zeroes when necessary) Map all items to level 0 Map items that have one leading zero in ℎ(𝑥) to level 1 Map items that have two leading zeroes in ℎ(𝑥) to level 2 … Use a 2𝑠 -sparse recovery summary for each level Small Summaries for Big Data

Estimating Set Similarity with MinHash 𝐽 𝐴,𝐵 = 𝐴∩𝐵 𝐴∪𝐵 Assumption: The sample from 𝐴 and the sample from 𝐵 are obtained by the same ℎ Pr⁡[(A item randomly sampled from 𝐴∪𝐵) ∈𝐴∩𝐵]=𝐽(𝐴,𝐵) To estimate 𝐽(𝐴,𝐵): Merge Sample(𝐴) and Sample(𝐵) to get Sample(𝐴∪𝐵) of size 𝑠 Count how many items in Sample(𝐴∪𝐵) are in 𝐴∩𝐵 Divide by 𝑠 Small Summaries for Big Data

Small Summaries for Big Data Advanced Summaries Small Summaries for Big Data

Geometric data: -approximations A sample that preserves “density” of point sets For any range (e.g., a circle), |fraction of sample point − fraction of all points|≤𝜀 Can simply draw a random sample of size 𝑂(1/ 𝜀 2 ) More careful construction yields size 𝑂(1/ 𝜀 2𝑑/(𝑑+1) ) Summarizing Disitributed Data

Geometric data: -kernels -kernels approximately preserve the convex hull of points -kernel has size O(1/(d-1)/2) Streaming -kernel has size O(1/(d-1)/2 log(1/)) Summarizing Disitributed Data

Small Summaries for Big Data Graph Sketching [AGM12] Goal: Build a sketch for each vertex and use it to answer queries Connectivity: want to test if there is a path between any two nodes Trivial if edges can only be inserted; want to support deletions Basic idea: repeatedly contract edges between components Use L0 sampling to get edges from vector of adjacencies The L0 sampling sketch supports deletions and merges Problem: as components grow, sampling edges from components most likely to produce internal links Small Summaries for Big Data

Small Summaries for Big Data Graph Sketching Idea: use clever encoding of edges Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i When node i and node j get merged, sum their L0 sketches Contribution of edge (i,j) exactly cancels out Only non-internal edges remain in the L0 sketches Use independent sketches for each iteration of the algorithm Only need O(log n) rounds with high probability Result: O(poly-log n) space per node for connectivity + i = j Small Summaries for Big Data

Other Graph Results via sketching Recent flurry of activity in summaries for graph problems K-connectivity via connectivity Bipartiteness via connectivity: (Weight of the) Minimum spanning tree: Sparsification: find G’ with few edges so that cut(G,C)  cut(G’,C) Matching: find a maximal matching (assuming it is small) Total cost is typical O(|V|), rather than O(|E|) Semi-streaming / semi-external model Small Summaries for Big Data

Small Summaries for Big Data Matrix Sketching Given matrices A, B, want to approximate matrix product AB Measure the normed error of approximation C: ǁAB – Cǁ Main results for the Frobenius (entrywise) norm ǁǁF ǁCǁF = (i,j Ci,j2)½ Results rely on sketches, so this entrywise norm is most natural Small Summaries for Big Data

Direct Application of Sketches Build AMS sketch of each row of A (Ai), each column of B (Bj) Estimate Ci,j by estimating inner product of Ai with Bj Absolute error in estimate is  ǁAiǁ2 ǁBjǁ2 (whp) Sum over all entries in matrix, squared error is ǁAǁFǁBǁF Outline formalized & improved by Clarkson & Woodruff [09,13] Improve running time to linear in number of non-zeros in A,B Small Summaries for Big Data

Compressed Matrix Multiplication What if we are just interested in the large entries of AB? Or, the ability to estimate any entry of (AB) Arises in recommender systems, other ML applications If we had a sketch of (AB), could find these approximately Compressed Matrix Multiplication [Pagh 12]: Can we compute sketch(AB) from sketch(A) and sketch(B)? To do this, need to dive into structure of the Count (AMS) sketch Several insights needed to build the method: Express matrix product as summation of outer products Take convolution of sketches to get a sketch of outer product New hash function enables this to proceed Use the FFT to speed up from O(w2) to O(w log w) Small Summaries for Big Data

Small Summaries for Big Data More Linear Algebra Matrix multiplication improvement: use more powerful hash fns Obtain a single accurate estimate with high probability Linear regression given matrix A and vector b: find x  Rd to (approximately) solve minx ǁAx – bǁ Approach: solve the minimization in “sketch space” From a summary of size O(d2/) [independent of rows of A] Frequent directions: approximate matrix-vector product [Ghashami, Liberty, Phillips, Woodruff 15] Use the SVD to (incrementally) summarize matrices The relevant sketches can be built quickly: proportional to the number of nonzeros in the matrices (input sparsity) Survey: Sketching as a tool for linear algebra [Woodruff 14] Small Summaries for Big Data

Current Directions in Data Summarization Sparse representations of high dimensional objects Compressed sensing, sparse fast fourier transform General purpose numerical linear algebra for (large) matrices k-rank approximation, linear regression, PCA, SVD, eigenvalues Summaries to verify full calculation: a ‘checksum for computation’ Geometric (big) data: coresets, clustering, machine learning Use of summaries in large-scale, distributed computation Build them in MapReduce, Continuous Distributed models Communication-efficient maintenance of summaries As the (distributed) input is modified Small Summaries for Big Data

Small Summary of Small Summaries Two complementary approaches in response to growing data sizes Scale the computation up; scale the data down The theory and practice of data summarization has many guises Sampling theory (since the start of statistics) Streaming algorithms in computer science Compressive sampling, dimensionality reduction… (maths, stats, CS) Continuing interest in applying and developing new theory Small Summaries for Big Data

Small Summaries for Big Data Thank you! Small Summaries for Big Data