Matrix Sketching over Sliding Windows Zhewei Wei1, Xuancheng Liu1, Feifei Li2, Shuo Shang1 Xiaoyong Du1, Ji-Rong Wen1 1 School of Information, Renmin University of China 2 School of Computing, The University of Utah I assume that you are not familiar with the notation of matrix sketching, and given the time limit, this talk will focus on problem definitions and motivations, and all technicalities will be ignored.
Matrix data Modern data sets are modeled as large matrices. Think of 𝐴∈ 𝑅 𝑛×𝑑 as n rows in 𝑅 𝑑 . Data Rows Columns d n Textual Documents Words 105 – 107 >1010 Actions Users Types 101 – 104 >107 Visual Images Pixels, SIFT 105 – 106 >108 Audio Songs, tracks Frequencies Machine Learning Examples Features 102 – 104 >106 Financial Prices Items, Stocks 103 – 105 Let’s talk about matrix data. Many modern data sets are now modeled s large matrices. We think of these matrice as n rows in d dimension. Take text data for example, each row of the matrix A represents a document, and each column represents a word. In the bag of word model, the entry ij is 1 if and only if document I contains word j. In these matrices, the dimension d is large, and the row number n is even larger. So this is sort of a skinny matrix
Singular Value Decomposition (SVD) 𝐴 𝑈 Σ 𝑉 𝑇 … 𝑣 11 𝑎 11 𝑎 1𝑑 … 𝑣 𝑑1 … 𝑢 11 … 𝑢 1𝑛 𝛿 1 𝛿 2 ⋮ … ⋮ × × ⋱ ⋮ ⋮ ⋮ … ⋮ ⋮ … ⋮ … 𝛿 𝑑 𝑣 1𝑑 … 𝑣 𝑛𝑑 = … ⋮ ⋮ … ⋮ 𝑢 𝑛1 … 𝑢 𝑛𝑛 … 𝑎 𝑛1 … 𝑎 𝑛𝑑 One of the most commonly used techniques for analyzing such matrices is the singular value decomposition Principal component analysis (PCA) K-means clustering Latent semantic indexing (LSI)
SVD & Eigenvalue decomposition 𝐴 𝐴 𝑇 𝑎 11 … 𝑎 𝑛1 𝑎 11 … 𝑎 1𝑑 Covariance Matrix 𝐴 𝑇 𝐴 ⋮ … ⋮ × 𝑎 1𝑑 … 𝑎 𝑛𝑑 ⋮ … ⋮ 𝑎 𝑛1 … 𝑎 𝑛𝑑 𝑉 Σ 2 𝑉 𝑇 To compute the svd, a typical way is to compute A transpose times A, which is a d by d square matrix, and use power iteration techniques. Essentially, this is because the eigenvalue decomposition of A transpose A is V times sigma square times V transpose. So if we want to maintain the svd or low rank approximation of A, we only need to maintain A transpose A. 𝑣 11 … 𝑣 1𝑑 … 𝑣 11 … 𝑣 𝑑1 𝛿 1 2 𝛿 2 2 ⋮ … = ⋮ ⋮ … ⋮ × × ⋱ ⋮ ⋮ 𝑣 𝑑1 … 𝑣 𝑛𝑑 … 𝛿 𝑑 2 𝑣 1𝑑 … 𝑣 𝑛𝑑
Matrix Sketching Computing SVD is slow (and offline). 𝑑 Computing SVD is slow (and offline). Matrix sketching: approximate large matrix 𝐴∈ 𝑅 𝑛×𝑑 with B∈ 𝑅 𝑙×𝑑 , 𝑙≪𝑛, in an online fashion. Row-update stream: each update receives a row. Covariance error [Liberty2013, Ghashami2014, Woodruff2016]: 𝐴 𝑇 𝐴− 𝐵 𝑇 𝐵 /||𝐴 || 𝐹 2 ≤𝜀. Feature hashing [Weinberger2009], random projection [Papadimitriou2011], … Frequent Directions (FD) [Liberty2013]: B∈ 𝑅 𝑙×𝑑 , 𝑙= 1 𝜀 , s.t. covariance error ≤𝜀. 𝐵 𝑙 𝑎 𝑖 𝐴 𝑛 Now lets talk about matrix sketching. The motivation of matrix sketching is that computing svd is slow, and offline. By online, we mean that matrix sketching works in row-update streams, By approximation, we mean that the covariance error between A and B must be small; 𝑎 𝑖
Matrix Sketching over Sliding Windows Each row is associated with a timestamp. Maintain 𝐵 𝑊 for 𝐴 𝑊 : rows in sliding window 𝑊. Covariance error: || 𝐴 𝑊 𝑇 𝐴 𝑊 − 𝐵 𝑊 𝑇 𝐵 𝑊 ||/|| 𝐴 𝑊 || 𝐹 2 ≤𝜀 Sequence-based window: past N rows. Time-based window: rows in a past time period Δ. For some reason (other than getting paper published), we are going to study the matrix sketching problem over sliding windows. 𝐴 𝑊 : 𝑁 rows 𝐴 𝑊 : rows in Δ time units
Motivation 1: Sliding windows vs. unbounded streams Sliding window model is a more appropriate model in many real-world applications. Particularly so in the areas of data analysis wherein matrix sketching techniques are widely used. Applications: Analyzing tweets for the past 24 hours. Sliding window PCA for detecting changes and anomalies [Papadimitriou2006, Qahtan2015]. Matrix sketching is proven successful for speeding up svd computation for batched data, but if you think about it, the online part actually makes less sense. The first reason comes from the long debate between sliding window model and unbounded streaming model
Motivation 2: Lower bound Unbounded stream solution: use O( 𝑑 2 ) space to store 𝐴 𝑇 𝐴. Update: 𝐴 𝑇 𝐴← 𝐴 𝑇 𝐴+ 𝑎 𝑖 𝑇 𝑎 𝑖 Theorem 4.1 An algorithm that returns 𝐴 𝑇 𝐴 for any sequence- based sliding window must use Ω(𝑁𝑑) bits space. Matrix sketching is necessary for sliding window, even when dimension 𝑑 is small. Matrix sketching over sliding windows requires new techniques. The third motivation comes from a lower bound.
Three algorithms Sampling: Sample 𝑎 𝑖 w.p. proportional to ||𝑎 𝑖 || 2 [Frieze2004]. Priority sampling[Efraimidis2006] + Sliding window top-k. LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? Sampling 𝑑 𝜀 2 log log 𝑁𝑅 𝑑 𝜀 2 log 𝑁𝑅 Sequence & time Yes LM-FD 𝑑 log 𝜀𝑁𝑅 1 𝜀 2 log 𝜀𝑁𝑅 No DI-FD 𝑑 𝜀 log 𝑅 𝜀 𝑅 𝜀 log 𝑅 𝜀 Sequence
Three algorithms Sampling: Sample 𝑎 𝑖 w.p. proportional to ||𝑎 𝑖 || 2 [Frieze2004]. Priority sampling[Efraimidis2006] + Sliding window top-k. LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small No DI-FD Best for small 𝑅 Sequence Interpretable: rows of the sketch 𝐵 come from 𝐴. 𝑅: ratio between maximum squared norm and minimum squared norms.
Experiments: space vs. error 𝑅=8.35 𝑅=1 𝑅=90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small No DI-FD Best for small 𝑅 Sequence These are the experiments on three datasets. The important parameter is R, the ratio between. The experimental results concur with our theoretical analysis. Firstly, Interpretable: rows of the sketch 𝐵 come from 𝐴. 𝑅: ratio between maximum squared norm and minimum squared norms.
Experiments: time vs. space 𝑅=8.35 𝑅=1 𝑅=90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small No DI-FD Best for small 𝑅 Sequence Interpretable: rows of the sketch 𝐵 come from 𝐴. 𝑅: ratio between maximum squared norm and minimum squared norms.
Conclusions First attempt to tackle the sliding window matrix sketching problem. Lower bounds show that the sliding window model is different from unbounded streaming model for the matrix sketching problem. Propose algorithms for both time-based and sequence- based windows with theoretical guarantee and experimental evaluation.
Thanks!
Experiments 𝑅=8.35 𝑅=1 𝑅=90089 LM-FD provide better space-error tradeoffs than sampling algorithms. DI-FD vs. LM-FD: depends on the ratio R SWOR vs. SWR: depends on data set.
Experiments Run algorithms on real world matrices. Measure actual covariance error, space and update time. Datasets for sequence-based windows: SYNTHETIC: random noisy matrix, used by [Liberty2013] BIBD: incidence matrix of a Balanced Incomplete Block Design from Mark Giesbrecht, University of Waterloo PAMAP: physical activity monitoring data set
Sampling based algorithms Insight [Frieze2004]: Sample each row 𝑎 𝑖 with probability proportional to its squared norm ||𝑎 𝑖 || 2 and rescale with proper factors. Priority sampling [Efraimidis2006] “Magical” priority 𝑢 1 ||𝑎 𝑖 || 2 , 𝑢=𝑟𝑎𝑛𝑑(0,1). Top-1 priority: a sample with probability proportional to ||𝑎 𝑖 || 2 . Sample with replacement (SWR): Run 𝑙 independent samplers.
Maintaining top-1 priority over sliding window time 1 1 2 1 3 1 𝑁 … For 𝑅=1: Probability in the skyline: Expectation: log 𝑁 Number of skyline points:O(log 𝑁𝑅). Space for sampling 𝑙 rows: 𝑂(𝑙 log 𝑁𝑅).
Logarithmic Method: LM-FD algorithm Work for time/sequence-based window. Combine FD with Exponential Histogram [Datar2002]. Mergeablility: 𝐵 1 =𝐹𝐷 𝐴 1 , 𝜀 , 𝐵 2 =𝐹𝐷 𝐴 2 , 𝜀 , then B= 𝐹𝐷 𝐵 1 , 𝐵 2 , 𝜀 is a FD sketch for A. Merge all blocks to form 𝐵.
Dyadic Interval: DI-FD algorithm Work for sequence-based window. Combine FD with dyadic interval techniques [Arasu2004]. Window of size 𝑁 can be decomposed into log𝑁 dyadic intervals. Sketches at different levels have different error parameters. Combine log𝑁 sketches to form 𝐵.
Low rank approximation 𝐴 𝑈 Σ 𝑉 𝑇 … 𝑣 11 𝑣 𝑑1 𝑢 11 … 𝑢 1𝑑 … 𝑢 11 … 𝑢 1𝑘 … 𝛿 1 ⋮ … ⋮ 𝛿 2 ⋱ × × 𝑣 1𝑘 … 𝑣 𝑘𝑑 𝛿 𝑘 ⋮ ⋮ ⋮ ⋮ ⋱ … … ⋮ … ⋮ ≈ ⋮ … ⋮ … ⋮ Principal component analysis (PCA) k-means clustering Latent Semantic Indexing (LSI) SVD is useful for computing low rank approximation, where you take the largest k singular values, and make the others 0.By doing so, the rank of the matrix is reduced to k. 𝑢 𝑛1 … 𝑢 𝑛𝑑 𝑢 𝑛1 … 𝑢 𝑛𝑘 …