Download presentation
Presentation is loading. Please wait.
Published byAmos Long Modified over 8 years ago
1
Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU MIT
2
OUTLINE Matrix Sketches Existence Samples better samples Iterative algorithms
3
DATA n-by-d matrix A, m entries Columns: data Rows: attributes A Goal: Classification/ clustering Identify patterns Interpret new data
4
LINEAR MODEL Can add/scale data points x 1 : coefficients, combo: Ax Axx 1 A :,1 x 2 A :,2 x 3 A :,3
5
PROBLEM Interpret new data point b as combination of known ones Ax ?
6
REGRESSION Express as combination of current examples Regression: min x ║ Ax–b ║ p p=2: least squares p=1: compressive sensing ║ x ║ 2 : Euclidean norm of x ║ x ║ 1 : sum of absolute values
7
VARIANTS OF COMPRESSIVE SENSING min x ║ Ax-b ║ 1 +║ x ║ 1 min x ║ Ax-b ║ 2 +║ x ║ 1 min x ║ x ║ 1 s.t. Ax = b min x ║ Ax ║ 1 s.t. Bx = y min x ║ Ax-b ║ 1 + ║ Bx - y ║ 1 All similar to min x ║ Ax - b ║ 1
8
SIMPLIFIED min x ║ Ax–b ║ p = min x ║[ A, b ] [ x ; -1]║ p Regression equivalent to min║ Ax ║ p with one entry of x fixed Ab x
9
‘BIG’ DATA POINTS Each data point has many attributes #rows (n) >> #columns (d) Examples: Genetic data Time series (videos) Reverse (d>>n) also common: images + SIFT A
10
FASTER? A’ A Smaller, equivalent A ’ Matrix sketch
11
ROW SAMPLING A’ A Pick some rows of A to be A ’ How to pick? Random
12
SHORTER EQUIVALENT Find shorter A ’ that preserves answer | Ax | p ≈ 1+ε | A’x | p for all x Run algorithm on A ’, same answer good for A A’ Simplified error notation ≈: a≈ k b if there exists k 1, k 2 s.t. k 2 /k 1 ≤ k and k 1 a ≤ b ≤ k 2 b
13
OUTLINE Matrix Sketches How? Existence Samples better samples Iterative algorithms
14
SKETCHES EXIST Linear sketches: A ’= SA [Drineals et al. `12]: Row sampling: one non- zero in each row of S [Clarkson-Woodruff `12]: S = countSketch, one non-zero per column. A’ | Ax | p ≈| A’x | p for all x
15
SKETCHES EXIST p=2p=1 Dasgupta et al. `09d 2.5 Magdon-Ismail `10dlog 2 d Sohler & Woodruff `11d 3.5 Drineals et al. `12dlogd Clarkson et al. `12d 4.5 log 1.5 d Clarkson & Woodruff `12d 2 logdd8d8 Mahoney & Meng `12d2d2 d 3.5 Nelson & Nguyen `12d 1+α This Paperdlogdd 3.66 Hidden: runtime costs, ε -2 dependency
16
WHY IS ≈D POSSIBLE? ║ Ax ║ 2 2 = x T A T Ax A T A : d-by-d matrix Any factorization (e.g. QR) of A T A suffices as A ’ | Ax | p ≈| A’x | p for all x
17
ATAATA Covariance matrix Dot product of all pairs of columns (data) Covariance: cov(j1,j2) = Σ i A i, j1 T A i, j2 A :,j1 A :,j2
18
USE OF COVARIANCE MATRIX Clustering: l 2 distances of all pairs given by C Kernel methods: all pair dot products suffice for many models. C C=ATAC=ATA
19
OTHER USE OF COVARIANCE Covariance of attributes used to tune parameters Images + SIFT: many data points, few attributes. http://www.image-net.org/: 14,197,122 images 1000 SIFT features http://www.image-net.org/ C
20
HOW EXPENSIVE IS THIS? d 2 dots of length n vectors Total: O(nd 2 ) Faster: O(nd ω-1 ) Expensive: nd 2 > nd > m A C
21
EQUIVALENT VIEW OF SKETCHES Approximate covariance matrix: C ’=( A ’) T A ’ ║ Ax ║ 2 ≈║ A ’ x ║ 2 is the same as C ≈ C ’ C’ A’
22
APPLICATION OF SKETCHES A ’: n’ rows d 2 dots of length n’ vectors Total cost: O(n’d ω-1 ) A C’ A’
23
SKETCHES IN INPUT SPARSITY TIME Need: cost of computing C ’ < cost of computing C = A T A 2 goals: n’ small A ’ found efficiently A C’ A’
24
COST AND QUALITY OF A ’ p=2p=1 costsizecostsize Dasgupta et al. `09nd 5 d 2.5 Magdon-Ismail `10nd 2 /logddlog 2 d Sohler & Woodruff `11nd ω-1+α d 3.5 Drineals et al. `12ndlogd+d ω dlogd Clarkson et al. `12ndlogdd 4.5 log 1.5 d Clarkson & Woodruff `12md 2 logdm + d 7 d8d8 Mahoney & Meng `12md2d2 mlogn+d 8 d 3.5 Nelson & Nguyen `12md 1+α Same as above This Paperm + d ω+α dlogdm + d ω+α d 3.66
25
OUTLINE Matrix Sketches How? Existence Samples better samples Iterative algorithms
26
PREVIOUS APPROACHES Go go poly(d) rows directly Projection to obtain key info, or the sketch itself A A’ m poly(d) A miracle happens
27
OUR MAIN APPROACH Utilize the robustness of sketches, covariance matrices, and sampling Iteratively reduce errors and sizes A A” A’
28
BETTER ALGORITHM FOR P=2 p=2p=1 costsizecostsize Dasgupta et al. `09nd 5 d 2.5 Magdon-Ismail `10nd 2 /logddlog 2 d Sohler & Woodruff `11nd ω-1+α d 3.5 Drineals et al. `12ndlogd+d ω dlogd Clarkson et al. `12ndlogdd 4.5 log 1.5 d Clarkson & Woodruff `12md 2 logdm + d 7 d8d8 Mahoney & Meng `12md2d2 mlogn+d 8 d 3.5 Nelson & Nguyen `12md 1+α Same as above This Paperm + d ω+α dlogdm + d ω+α d 3.66
29
COMPOSING SKETCHES Total cost: O(m + n’dlogd + d ω ) = O(m + d ω ) A A”A” A’A’ n rows n’ = d 1+α O(m) O(n’dlogd +d ω ) dlogd rows
30
ACCUMULATION OF ERRORS A A”A” A’A’ n rows n’ = d 1+α ║ Ax ║ 2 ≈ k ║ A ” x ║ 2 dlogd rows ║ A”x ║ 2 ≈ k’ ║ A ’ x ║ 2 ║ Ax ║ 2 ≈ kk’ ║ A ’ x ║ 2
31
ACCMULATION OF ERRORS ║ Ax ║ 2 ≈ kk’ ║ A ’ x ║ 2 Final error: product of both errors Dependency of error in cost: usually ε -2 or more for 1± ε error [Avron & Toledo `11]: only final step needs to be accurate Idea: compute sketches indirectly
32
ROW SAMPLING A’ A Pick some rows of A to be A ’ How to pick? Random
33
ARE ALL ROWS EQUAL? one non-zero row A column with one entry A | A [1;0;…;0]| p ≠ 0
34
ROW SAMPLING A’ A τ ’ : weights on rows distribution Pick a number of rows independently from this distribution, rescale to form A ’
35
MATRIX CHERNOFF BOUNDS A Sufficient property of τ ’ τ : statistical leverage scores If τ ' ≥ τ, ║ τ '║ 1 logd (scaled) rows suffices for A ’ ≈ A τ'τ'
36
STATISTICAL LEVERAGE SCORES Studied in stats since 70s Importance of rows leverage score of row i, A i : τ i = A i ( A T A ) -1 A i T Key fact: ║ τ ║ 1 = rank ≤ d ║ τ ' ║ 1 logd = dlogd rows A τ
37
COMPUTING LEVERAGE SCORES τ i = A i ( A T A ) -1 A i T = A i C -1 A i T A T A: covariance matrix, C Given C -1, can compute each τ i in O(d 2 ) time Total cost: O(nd 2 +d ω )
38
COMPUTING LEVERAGE SCORES τ i = A i C -1 A i T =║ A i C -1/2 ║ 2 2 2-norm of a vector, A i C -1/2 rows in isotropic positions Decorrelates columns
39
ASIDE: WHAT IS LEVERAGE? Geometric view: Rows define ‘energy ’ directions. Normalize so total energy is uniform τ i : norm of row i after normalizing AiAi A i C -1/2
40
ASIDE: WHAT IS LEVERAGE? How to interpret statistical leverage scores? Statistics ([Hoaglin-Welsh `78], [Chatterjee-Hadi `86]): Influence on data set Likelihood of outlier Uniqueness of Row A τ
41
ASIDE: WHAT IS LEVERAGE? High Leverage Score: Key attribute? Outlier (measuring error)?
42
ASIDE: WHAT IS LEVERAGE? My current view (motivated by graph sparsification): Sampling probabilities Use them to find sketches A τ
43
COMPUTING LEVERAGE SCORES τ i = ║ A i C -1/2 ║ 2 2 Only need τ' ≥ τ Can use approximations after scaling them up Error leads to larger ║ τ ' ║ 1
44
DIMENSIONALITY REDUCTION ║ x ║ 2 2 ≈ jl ║ G x ║ 2 2 Johnson Lindenstrauss Transform G : d-by-O(1/α) Gaussian Error jl = d α x Gx
45
ESTIMATING LEVERAGE SCORES τ i = ║ A i C -1/2 ║ 2 2 ≈ jl ║ A i C -1/2 G ║ 2 2 G : d-by-O(1/α) Gaussian C 1/2 G : d-by-O(1/α) Cost: O(α ∙ nnz( A i )) total: O(α ∙ m + α ∙ d 2 logd)
46
ESTIMATING LEVERAGE SCORES C ≈ k C ’ gives ║ C -1/2 x ║ 2 ≈ k ║ C’ -1/2 x ║ 2 Using C ’ as a preconditioner for C Can also combine with JL τ i = ║ A i C -1/2 ║ 2 2 ≈ ║ A i C ’ -1/2 ║ 2 2
47
ESTIMATING LEVERAGE SCORES τ i ’ = ║ A i C’ -1/2 G ║ 2 2 ≈ jl ║ A i C -1/2 ║ 2 2 ≈ jl∙k τ i (jl ∙ k) ∙ τ ’ ≥ τ Total number of rows: ║ jl ∙ k ∙ τ ’ ║ 1 ≤ jl ∙ k ∙ ║ τ ’ ║ 1 ≤ k d 1 + α
48
ESTIMATING LEVERAGE SCORES Quality of A ’ does not depend on quality of τ ' C ≈ k C ’ gives A ’ ≈ 2 A with O(kd 1+α ) rows in O(m + d ω ) time (jl ∙ k) ∙ τ ’ ≥ τ ║ jl ∙ k ∙ τ ’ ║ 1 ≤ jl ∙ k ∙ d 1+α Some fixable issues when n >>>d
49
SIZE REDUCTION A ” ≈ O(1) A C ” ≈ O(1) C τ' ≈ O(1) τ A ’ ≈ O(1) A, O(d 1+α logd) rows A”A” C”C” τ'τ' A’A’
50
HIGH ERROR SETTING A ” ≈ k A C ” ≈ k C τ' ≈ k τ A ’ ≈ O(1) A, O(kd 1+α logd) rows A”A” C”C” τ'τ' A’A’
51
ACCURACY BOOSTING Can reduce any error, k, in O(m + kd ω +α ) time All intermediate steps can have large (constant) error A A’’ A’
52
OUTLINE Matrix Sketches How? Existence Samples better samples Iterative algorithms
53
ONE STEP SKETCHING Obtain sketch of size poly(d) Error correct to O(dlogd) rows in poly(d) time A A’A’ A” m poly(d) dlogd A miracle happens
54
WHAT WE WILL SHOW A number of iterative steps can give a similar result More work, less miraculous, more robust Key idea: find leverage scores
55
ALGORITHMIC PICTURE C’C’ τ'τ' A’A’ sketch, covariance matrix, leverage scores with error k gives all three with high accuracy in O(m + kd ω+α ) time
56
OBSERVATIONS C’C’ τ'τ' A’A’ Error does not accumulate Can loop around many times Unused parameter: size of A ≈k≈k ≈k≈k ≈ O(1), O(K) size increase
57
OUR APPROACH A AsAs Create shorter matrix A s s.t. total leverage score of each block is close
58
LEVERAGE SCORE OF A BLOCK A AsAs l 2 2 of leverage scores : Frobenius norm of A 1:k C -1/2 ≈ under random projection G : O(1)-by-k, GA 1:k : O(1) rows ║ τ 1..k ║ 2 2 =║ A 1:k C -1/2 ║ F 2 ≈║ GA 1:k C -1/2 ║ F 2
59
SIZE REDUCTION Recursing on A s gives leverages scores that: Sum to ≤d Can row sample A A AsAs
60
ALGORITHM C’C’ τ'τ' A’A’ Decrease size by d α, recurse Bring back leverage scores Reduce error ≈k≈k ≈k≈k ≈ O(1), O(K) size increase
61
PROBLEM Leverage scores in A s measured using C s = A s T A s Already have bound on total, suffices to show ║ xC -1/2 ║ 2 ≤ k║ xC s -1/2 ║ 2
62
PROOF SKETCH Show ║ C s 1/2 x | 2 ≤ k║ C 1/2 x ║ 2 Invert both sides Some issues when A s has smaller rank than A Need: ║ xC -1/2 ║ 2 ≤ k║ xC s -1/2 ║ 2
63
║ C S 1/2 X ║ 2 ≤ K║ C 1/2 X ║ 2 ║ C s 1/2 x ║ 2 =║ A s x ║ 2 = Σ b Σ i ( G i,b A b T x ) 2 2 ≤ Σ b Σ i ║ G i,b ║ 2 2 ║ A b T x ║ 2 2 ≤ max b,i ║ G i,b ║ 2 2 ║ Ax ║ 2 2 ≤ O(klogn)║ Ax ║ 2 2 b: blocks of A s
64
P=1, OR ARBITRARY P Same approach can still work P-norm leverage scores Need: well-conditioned basis, U for column space ║ Ax ║ p ≈║ A’x ║ p for any x
65
QUALITY OF BASIS (P=1) Quality of U : maximum distortion in dual norm: β = max x ≠0 ║ Ux ║ ∞ /║ x ║ ∞ Analog of leverage scores: τ i = β║ U i,: ║ 1 Total number of rows: β║ U ║ 1
66
BASIS CONSTRUCTION Basis using linear transform, U = AC Compute |U i | 1 using p-stable distributions (Indyk `06) instead of JL C 1, U τ'τ' A’A’ ≈k≈k ≈k≈k ≈ O(1), O(K) size increase
67
ITERATIVE ALGORITHM FOR P=1 C 1 = C -1/2, l 2 basis Quality of U = AC 1 : β║ U i,: ║ 1 = n 1/2 d Too coarse for a single step, but good enough to iterate n approaches poly(d) quickly Need to run l 2 algorithm for C
68
SUMMARY p=2p=1 Cost for dlog rowscostsize Sohler & Woodruff `11nd ω-1+α d 3.5 Drineals et al. `12ndlogd+d ω Clarkson et al. `12ndlogdd 4.5 log 1.5 d Clarkson & Woodruff `12m+d 3 log 2 dm + d 7 d8d8 Mahoney & Meng `12m+d 3 logdmlogn+d 8 d 3.5 Nelson & Nguyen `12m+d ω Same as above This Paperm + d ω+α d 3.66 Robust steps algorithms l2: more complicated than sketching Smaller overhead for p-norm
69
FUTURE WORK What are leverage scores??? Iterative low rank approximation? Better p-norm leverage scores? More streamlined view of the projections in our algorithm? Empirical evaluation?
70
THANK YOU! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.