Matrix Sketching over Sliding Windows

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Google News Personalization: Scalable Online Collaborative Filtering
Eigen Decomposition and Singular Value Decomposition
Aggregating local image descriptors into compact codes
Covariance Matrix Applications
Component Analysis (Review)
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Tensors and Component Analysis Musawir Ali. Tensor: Generalization of an n-dimensional array Vector: order-1 tensor Matrix: order-2 tensor Order-3 tensor.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Dimensionality Reduction PCA -- SVD
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Kathryn Linehan Advisor: Dr. Dianne O’Leary
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Presented By Wanchen Lu 2/25/2013
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
SINGULAR VALUE DECOMPOSITION (SVD)
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Education 795 Class Notes Factor Analysis Note set 6.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Dimensionality Reduction and Principle Components Analysis
New Characterizations in Turnstile Streams with Applications
School of Computer Science & Engineering
Lecture 8:Eigenfaces and Shared Features
Lecture: Face Recognition and Feature Reduction
Streaming & sampling.
Basic machine learning background with Python scikit-learn
LSI, SVD and Data Management
Spatial Online Sampling and Aggregation
Singular Value Decomposition
ANTHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHAOLNISKY, AND MARK TYGERT
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Y. Kotidis, S. Muthukrishnan,
Parallelization of Sparse Coding & Dictionary Learning
Lecture 13: Singular Value Decomposition (SVD)
Packing to fewer dimensions
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
Learning-Based Low-Rank Approximations
Presentation transcript:

Matrix Sketching over Sliding Windows Zhewei Wei1, Xuancheng Liu1, Feifei Li2, Shuo Shang1 Xiaoyong Du1, Ji-Rong Wen1 1 School of Information, Renmin University of China 2 School of Computing, The University of Utah I assume that you are not familiar with the notation of matrix sketching, and given the time limit, this talk will focus on problem definitions and motivations, and all technicalities will be ignored.

Matrix data Modern data sets are modeled as large matrices. Think of 𝐴∈ 𝑅 𝑛×𝑑 as n rows in 𝑅 𝑑 . Data Rows Columns d n Textual Documents Words 105 – 107 >1010 Actions Users Types 101 – 104 >107 Visual Images Pixels, SIFT 105 – 106 >108 Audio Songs, tracks Frequencies Machine Learning Examples Features 102 – 104 >106 Financial Prices Items, Stocks 103 – 105 Let’s talk about matrix data. Many modern data sets are now modeled s large matrices. We think of these matrice as n rows in d dimension. Take text data for example, each row of the matrix A represents a document, and each column represents a word. In the bag of word model, the entry ij is 1 if and only if document I contains word j. In these matrices, the dimension d is large, and the row number n is even larger. So this is sort of a skinny matrix

Singular Value Decomposition (SVD) 𝐴 𝑈 Σ 𝑉 𝑇 … 𝑣 11 𝑎 11 𝑎 1𝑑 … 𝑣 𝑑1 … 𝑢 11 … 𝑢 1𝑛 𝛿 1 𝛿 2 ⋮ … ⋮ × × ⋱ ⋮ ⋮ ⋮ … ⋮ ⋮ … ⋮ … 𝛿 𝑑 𝑣 1𝑑 … 𝑣 𝑛𝑑 = … ⋮ ⋮ … ⋮ 𝑢 𝑛1 … 𝑢 𝑛𝑛 … 𝑎 𝑛1 … 𝑎 𝑛𝑑 One of the most commonly used techniques for analyzing such matrices is the singular value decomposition Principal component analysis (PCA) K-means clustering Latent semantic indexing (LSI)

SVD & Eigenvalue decomposition 𝐴 𝐴 𝑇 𝑎 11 … 𝑎 𝑛1 𝑎 11 … 𝑎 1𝑑 Covariance Matrix 𝐴 𝑇 𝐴 ⋮ … ⋮ × 𝑎 1𝑑 … 𝑎 𝑛𝑑 ⋮ … ⋮ 𝑎 𝑛1 … 𝑎 𝑛𝑑 𝑉 Σ 2 𝑉 𝑇 To compute the svd, a typical way is to compute A transpose times A, which is a d by d square matrix, and use power iteration techniques. Essentially, this is because the eigenvalue decomposition of A transpose A is V times sigma square times V transpose. So if we want to maintain the svd or low rank approximation of A, we only need to maintain A transpose A. 𝑣 11 … 𝑣 1𝑑 … 𝑣 11 … 𝑣 𝑑1 𝛿 1 2 𝛿 2 2 ⋮ … = ⋮ ⋮ … ⋮ × × ⋱ ⋮ ⋮ 𝑣 𝑑1 … 𝑣 𝑛𝑑 … 𝛿 𝑑 2 𝑣 1𝑑 … 𝑣 𝑛𝑑

Matrix Sketching Computing SVD is slow (and offline). 𝑑 Computing SVD is slow (and offline). Matrix sketching: approximate large matrix 𝐴∈ 𝑅 𝑛×𝑑 with B∈ 𝑅 𝑙×𝑑 , 𝑙≪𝑛, in an online fashion. Row-update stream: each update receives a row. Covariance error [Liberty2013, Ghashami2014, Woodruff2016]: 𝐴 𝑇 𝐴− 𝐵 𝑇 𝐵 /||𝐴 || 𝐹 2 ≤𝜀. Feature hashing [Weinberger2009], random projection [Papadimitriou2011], … Frequent Directions (FD) [Liberty2013]: B∈ 𝑅 𝑙×𝑑 , 𝑙= 1 𝜀 , s.t. covariance error ≤𝜀. 𝐵 𝑙 𝑎 𝑖 𝐴 𝑛 Now lets talk about matrix sketching. The motivation of matrix sketching is that computing svd is slow, and offline. By online, we mean that matrix sketching works in row-update streams, By approximation, we mean that the covariance error between A and B must be small; 𝑎 𝑖

Matrix Sketching over Sliding Windows Each row is associated with a timestamp. Maintain 𝐵 𝑊 for 𝐴 𝑊 : rows in sliding window 𝑊. Covariance error: || 𝐴 𝑊 𝑇 𝐴 𝑊 − 𝐵 𝑊 𝑇 𝐵 𝑊 ||/|| 𝐴 𝑊 || 𝐹 2 ≤𝜀 Sequence-based window: past N rows. Time-based window: rows in a past time period Δ. For some reason (other than getting paper published), we are going to study the matrix sketching problem over sliding windows. 𝐴 𝑊 : 𝑁 rows 𝐴 𝑊 : rows in Δ time units

Motivation 1: Sliding windows vs. unbounded streams Sliding window model is a more appropriate model in many real-world applications. Particularly so in the areas of data analysis wherein matrix sketching techniques are widely used. Applications: Analyzing tweets for the past 24 hours. Sliding window PCA for detecting changes and anomalies [Papadimitriou2006, Qahtan2015]. Matrix sketching is proven successful for speeding up svd computation for batched data, but if you think about it, the online part actually makes less sense. The first reason comes from the long debate between sliding window model and unbounded streaming model

Motivation 2: Lower bound Unbounded stream solution: use O( 𝑑 2 ) space to store 𝐴 𝑇 𝐴. Update: 𝐴 𝑇 𝐴← 𝐴 𝑇 𝐴+ 𝑎 𝑖 𝑇 𝑎 𝑖 Theorem 4.1 An algorithm that returns 𝐴 𝑇 𝐴 for any sequence- based sliding window must use Ω(𝑁𝑑) bits space. Matrix sketching is necessary for sliding window, even when dimension 𝑑 is small. Matrix sketching over sliding windows requires new techniques. The third motivation comes from a lower bound.

Three algorithms Sampling: Sample 𝑎 𝑖 w.p. proportional to ||𝑎 𝑖 || 2 [Frieze2004]. Priority sampling[Efraimidis2006] + Sliding window top-k. LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? Sampling 𝑑 𝜀 2 log log 𝑁𝑅 𝑑 𝜀 2 log 𝑁𝑅 Sequence & time Yes LM-FD 𝑑 log 𝜀𝑁𝑅 1 𝜀 2 log 𝜀𝑁𝑅 No DI-FD 𝑑 𝜀 log 𝑅 𝜀 𝑅 𝜀 log 𝑅 𝜀 Sequence

Three algorithms Sampling: Sample 𝑎 𝑖 w.p. proportional to ||𝑎 𝑖 || 2 [Frieze2004]. Priority sampling[Efraimidis2006] + Sliding window top-k. LM-FD: Exponential Histogram (Logarithmic method) [Datar2002] + Frequent Directions. DI-FD: Dyadic interval techniques [Arasu2004] + Frequent Directions. Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small No DI-FD Best for small 𝑅 Sequence Interpretable: rows of the sketch 𝐵 come from 𝐴. 𝑅: ratio between maximum squared norm and minimum squared norms.

Experiments: space vs. error 𝑅=8.35 𝑅=1 𝑅=90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small No DI-FD Best for small 𝑅 Sequence These are the experiments on three datasets. The important parameter is R, the ratio between. The experimental results concur with our theoretical analysis. Firstly, Interpretable: rows of the sketch 𝐵 come from 𝐴. 𝑅: ratio between maximum squared norm and minimum squared norms.

Experiments: time vs. space 𝑅=8.35 𝑅=1 𝑅=90089 Sketches Update Space Window Interpretable? Sampling Slow Large Sequence & time Yes LM-FD Fast Small No DI-FD Best for small 𝑅 Sequence Interpretable: rows of the sketch 𝐵 come from 𝐴. 𝑅: ratio between maximum squared norm and minimum squared norms.

Conclusions First attempt to tackle the sliding window matrix sketching problem. Lower bounds show that the sliding window model is different from unbounded streaming model for the matrix sketching problem. Propose algorithms for both time-based and sequence- based windows with theoretical guarantee and experimental evaluation.

Thanks!

Experiments 𝑅=8.35 𝑅=1 𝑅=90089 LM-FD provide better space-error tradeoffs than sampling algorithms. DI-FD vs. LM-FD: depends on the ratio R SWOR vs. SWR: depends on data set.

Experiments Run algorithms on real world matrices. Measure actual covariance error, space and update time. Datasets for sequence-based windows: SYNTHETIC: random noisy matrix, used by [Liberty2013] BIBD: incidence matrix of a Balanced Incomplete Block Design from Mark Giesbrecht, University of Waterloo PAMAP: physical activity monitoring data set

Sampling based algorithms Insight [Frieze2004]: Sample each row 𝑎 𝑖 with probability proportional to its squared norm ||𝑎 𝑖 || 2 and rescale with proper factors. Priority sampling [Efraimidis2006] “Magical” priority 𝑢 1 ||𝑎 𝑖 || 2 , 𝑢=𝑟𝑎𝑛𝑑(0,1). Top-1 priority: a sample with probability proportional to ||𝑎 𝑖 || 2 . Sample with replacement (SWR): Run 𝑙 independent samplers.

Maintaining top-1 priority over sliding window time 1 1 2 1 3 1 𝑁 … For 𝑅=1: Probability in the skyline: Expectation: log 𝑁 Number of skyline points:O(log 𝑁𝑅). Space for sampling 𝑙 rows: 𝑂(𝑙 log 𝑁𝑅).

Logarithmic Method: LM-FD algorithm Work for time/sequence-based window. Combine FD with Exponential Histogram [Datar2002]. Mergeablility: 𝐵 1 =𝐹𝐷 𝐴 1 , 𝜀 , 𝐵 2 =𝐹𝐷 𝐴 2 , 𝜀 , then B= 𝐹𝐷 𝐵 1 , 𝐵 2 , 𝜀 is a FD sketch for A. Merge all blocks to form 𝐵.

Dyadic Interval: DI-FD algorithm Work for sequence-based window. Combine FD with dyadic interval techniques [Arasu2004]. Window of size 𝑁 can be decomposed into log⁡𝑁 dyadic intervals. Sketches at different levels have different error parameters. Combine log⁡𝑁 sketches to form 𝐵.

Low rank approximation 𝐴 𝑈 Σ 𝑉 𝑇 … 𝑣 11 𝑣 𝑑1 𝑢 11 … 𝑢 1𝑑 … 𝑢 11 … 𝑢 1𝑘 … 𝛿 1 ⋮ … ⋮ 𝛿 2 ⋱ × × 𝑣 1𝑘 … 𝑣 𝑘𝑑 𝛿 𝑘 ⋮ ⋮ ⋮ ⋮ ⋱ … … ⋮ … ⋮ ≈ ⋮ … ⋮ … ⋮ Principal component analysis (PCA) k-means clustering Latent Semantic Indexing (LSI) SVD is useful for computing low rank approximation, where you take the largest k singular values, and make the others 0.By doing so, the rank of the matrix is reduced to k. 𝑢 𝑛1 … 𝑢 𝑛𝑑 𝑢 𝑛1 … 𝑢 𝑛𝑘 …