Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.

Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT

OUTLINE Matrix Sketches Existence Samples  better samples Iterative algorithms

DATA n-by-d matrix A, m entries Columns: data Rows: attributes A Goal: Classification/ clustering Identify patterns Interpret new data

LINEAR MODEL Can add/scale data points x 1 : coefficients, combo: Ax Axx 1 A :,1 x 2 A :,2 x 3 A :,3

PROBLEM Interpret new data point b as combination of known ones Ax ?

REGRESSION Express as combination of current examples Regression: min x ║ Ax–b ║ p p=2: least squares p=1: compressive sensing ║ x ║ 2 : Euclidean norm of x ║ x ║ 1 : sum of absolute values

VARIANTS OF COMPRESSIVE SENSING min x ║ Ax-b ║ 1 +║ x ║ 1 min x ║ Ax-b ║ 2 +║ x ║ 1 min x ║ x ║ 1 s.t. Ax = b min x ║ Ax ║ 1 s.t. Bx = y min x ║ Ax-b ║ 1 + ║ Bx - y ║ 1 All similar to min x ║ Ax - b ║ 1

SIMPLIFIED min x ║ Ax–b ║ p = min x ║[ A, b ] [ x ; -1]║ p Regression equivalent to min║ Ax ║ p with one entry of x fixed Ab x

‘BIG’ DATA POINTS Each data point has many attributes #rows (n) >> #columns (d) Examples: Genetic data Time series (videos) Reverse (d>>n) also common: images + SIFT A

FASTER? A’ A Smaller, equivalent A ’ Matrix sketch

ROW SAMPLING A’ A Pick some rows of A to be A ’ How to pick? Random

SHORTER EQUIVALENT Find shorter A ’ that preserves answer | Ax | p ≈ 1+ε | A’x | p for all x Run algorithm on A ’, same answer good for A A’ Simplified error notation ≈: a≈ k b if there exists k 1, k 2 s.t. k 2 /k 1 ≤ k and k 1 a ≤ b ≤ k 2 b

OUTLINE Matrix Sketches How? Existence Samples  better samples Iterative algorithms

SKETCHES EXIST Linear sketches: A ’= SA [Drineals et al. `12]: Row sampling: one non- zero in each row of S [Clarkson-Woodruff `12]: S = countSketch, one non-zero per column. A’ | Ax | p ≈| A’x | p for all x

SKETCHES EXIST p=2p=1 Dasgupta et al. `09d 2.5 Magdon-Ismail `10dlog 2 d Sohler & Woodruff `11d 3.5 Drineals et al. `12dlogd Clarkson et al. `12d 4.5 log 1.5 d Clarkson & Woodruff `12d 2 logdd8d8 Mahoney & Meng `12d2d2 d 3.5 Nelson & Nguyen `12d 1+α This Paperdlogdd 3.66 Hidden: runtime costs, ε -2 dependency

WHY IS ≈D POSSIBLE? ║ Ax ║ 2 2 = x T A T Ax A T A : d-by-d matrix Any factorization (e.g. QR) of A T A suffices as A ’ | Ax | p ≈| A’x | p for all x

ATAATA Covariance matrix Dot product of all pairs of columns (data) Covariance: cov(j1,j2) = Σ i A i, j1 T A i, j2 A :,j1 A :,j2

USE OF COVARIANCE MATRIX Clustering: l 2 distances of all pairs given by C Kernel methods: all pair dot products suffice for many models. C C=ATAC=ATA

OTHER USE OF COVARIANCE Covariance of attributes used to tune parameters Images + SIFT: many data points, few attributes. http://www.image-net.org/: 14,197,122 images 1000 SIFT features http://www.image-net.org/ C

HOW EXPENSIVE IS THIS? d 2 dots of length n vectors Total: O(nd 2 ) Faster: O(nd ω-1 ) Expensive: nd 2 > nd > m A C

EQUIVALENT VIEW OF SKETCHES Approximate covariance matrix: C ’=( A ’) T A ’ ║ Ax ║ 2 ≈║ A ’ x ║ 2 is the same as C ≈ C ’ C’ A’

APPLICATION OF SKETCHES A ’: n’ rows d 2 dots of length n’ vectors Total cost: O(n’d ω-1 ) A C’ A’

SKETCHES IN INPUT SPARSITY TIME Need: cost of computing C ’ < cost of computing C = A T A 2 goals: n’ small A ’ found efficiently A C’ A’

COST AND QUALITY OF A ’ p=2p=1 costsizecostsize Dasgupta et al. `09nd 5 d 2.5 Magdon-Ismail `10nd 2 /logddlog 2 d Sohler & Woodruff `11nd ω-1+α d 3.5 Drineals et al. `12ndlogd+d ω dlogd Clarkson et al. `12ndlogdd 4.5 log 1.5 d Clarkson & Woodruff `12md 2 logdm + d 7 d8d8 Mahoney & Meng `12md2d2 mlogn+d 8 d 3.5 Nelson & Nguyen `12md 1+α Same as above This Paperm + d ω+α dlogdm + d ω+α d 3.66

PREVIOUS APPROACHES Go go poly(d) rows directly Projection to obtain key info, or the sketch itself A A’ m poly(d) A miracle happens

OUR MAIN APPROACH Utilize the robustness of sketches, covariance matrices, and sampling Iteratively reduce errors and sizes A A” A’

BETTER ALGORITHM FOR P=2 p=2p=1 costsizecostsize Dasgupta et al. `09nd 5 d 2.5 Magdon-Ismail `10nd 2 /logddlog 2 d Sohler & Woodruff `11nd ω-1+α d 3.5 Drineals et al. `12ndlogd+d ω dlogd Clarkson et al. `12ndlogdd 4.5 log 1.5 d Clarkson & Woodruff `12md 2 logdm + d 7 d8d8 Mahoney & Meng `12md2d2 mlogn+d 8 d 3.5 Nelson & Nguyen `12md 1+α Same as above This Paperm + d ω+α dlogdm + d ω+α d 3.66

COMPOSING SKETCHES Total cost: O(m + n’dlogd + d ω ) = O(m + d ω ) A A”A” A’A’ n rows n’ = d 1+α O(m) O(n’dlogd +d ω ) dlogd rows

ACCUMULATION OF ERRORS A A”A” A’A’ n rows n’ = d 1+α ║ Ax ║ 2 ≈ k ║ A ” x ║ 2 dlogd rows ║ A”x ║ 2 ≈ k’ ║ A ’ x ║ 2 ║ Ax ║ 2 ≈ kk’ ║ A ’ x ║ 2

ACCMULATION OF ERRORS ║ Ax ║ 2 ≈ kk’ ║ A ’ x ║ 2 Final error: product of both errors Dependency of error in cost: usually ε -2 or more for 1± ε error [Avron & Toledo `11]: only final step needs to be accurate Idea: compute sketches indirectly

ROW SAMPLING A’ A Pick some rows of A to be A ’ How to pick? Random

ARE ALL ROWS EQUAL? one non-zero row A column with one entry A | A [1;0;…;0]| p ≠ 0

ROW SAMPLING A’ A τ ’ : weights on rows  distribution Pick a number of rows independently from this distribution, rescale to form A ’

MATRIX CHERNOFF BOUNDS A Sufficient property of τ ’ τ : statistical leverage scores If τ ' ≥ τ, ║ τ '║ 1 logd (scaled) rows suffices for A ’ ≈ A τ'τ'

STATISTICAL LEVERAGE SCORES Studied in stats since 70s Importance of rows leverage score of row i, A i : τ i = A i ( A T A ) -1 A i T Key fact: ║ τ ║ 1 = rank ≤ d  ║ τ ' ║ 1 logd = dlogd rows A τ

COMPUTING LEVERAGE SCORES τ i = A i ( A T A ) -1 A i T = A i C -1 A i T A T A: covariance matrix, C Given C -1, can compute each τ i in O(d 2 ) time Total cost: O(nd 2 +d ω )

COMPUTING LEVERAGE SCORES τ i = A i C -1 A i T =║ A i C -1/2 ║ 2 2 2-norm of a vector, A i C -1/2 rows in isotropic positions Decorrelates columns

ASIDE: WHAT IS LEVERAGE? Geometric view: Rows define ‘energy ’ directions. Normalize so total energy is uniform τ i : norm of row i after normalizing AiAi A i C -1/2

ASIDE: WHAT IS LEVERAGE? How to interpret statistical leverage scores? Statistics ([Hoaglin-Welsh `78], [Chatterjee-Hadi `86]): Influence on data set Likelihood of outlier Uniqueness of Row A τ

ASIDE: WHAT IS LEVERAGE? High Leverage Score: Key attribute? Outlier (measuring error)?

ASIDE: WHAT IS LEVERAGE? My current view (motivated by graph sparsification): Sampling probabilities Use them to find sketches A τ

COMPUTING LEVERAGE SCORES τ i = ║ A i C -1/2 ║ 2 2 Only need τ' ≥ τ Can use approximations after scaling them up Error leads to larger ║ τ ' ║ 1

DIMENSIONALITY REDUCTION ║ x ║ 2 2 ≈ jl ║ G x ║ 2 2 Johnson Lindenstrauss Transform G : d-by-O(1/α) Gaussian Error jl = d α x Gx

ESTIMATING LEVERAGE SCORES τ i = ║ A i C -1/2 ║ 2 2 ≈ jl ║ A i C -1/2 G ║ 2 2 G : d-by-O(1/α) Gaussian C 1/2 G : d-by-O(1/α) Cost: O(α ∙ nnz( A i )) total: O(α ∙ m + α ∙ d 2 logd)

ESTIMATING LEVERAGE SCORES C ≈ k C ’ gives ║ C -1/2 x ║ 2 ≈ k ║ C’ -1/2 x ║ 2 Using C ’ as a preconditioner for C Can also combine with JL τ i = ║ A i C -1/2 ║ 2 2 ≈ ║ A i C ’ -1/2 ║ 2 2

ESTIMATING LEVERAGE SCORES τ i ’ = ║ A i C’ -1/2 G ║ 2 2 ≈ jl ║ A i C -1/2 ║ 2 2 ≈ jl∙k τ i (jl ∙ k) ∙ τ ’ ≥ τ Total number of rows: ║ jl ∙ k ∙ τ ’ ║ 1 ≤ jl ∙ k ∙ ║ τ ’ ║ 1 ≤ k d 1 + α

ESTIMATING LEVERAGE SCORES Quality of A ’ does not depend on quality of τ ' C ≈ k C ’ gives A ’ ≈ 2 A with O(kd 1+α ) rows in O(m + d ω ) time (jl ∙ k) ∙ τ ’ ≥ τ ║ jl ∙ k ∙ τ ’ ║ 1 ≤ jl ∙ k ∙ d 1+α Some fixable issues when n >>>d

SIZE REDUCTION A ” ≈ O(1) A C ” ≈ O(1) C τ' ≈ O(1) τ A ’ ≈ O(1) A, O(d 1+α logd) rows A”A” C”C” τ'τ' A’A’

HIGH ERROR SETTING A ” ≈ k A C ” ≈ k C τ' ≈ k τ A ’ ≈ O(1) A, O(kd 1+α logd) rows A”A” C”C” τ'τ' A’A’

ACCURACY BOOSTING Can reduce any error, k, in O(m + kd ω +α ) time All intermediate steps can have large (constant) error A A’’ A’

ONE STEP SKETCHING Obtain sketch of size poly(d) Error correct to O(dlogd) rows in poly(d) time A A’A’ A” m poly(d) dlogd A miracle happens

WHAT WE WILL SHOW A number of iterative steps can give a similar result More work, less miraculous, more robust Key idea: find leverage scores

ALGORITHMIC PICTURE C’C’ τ'τ' A’A’ sketch, covariance matrix, leverage scores with error k gives all three with high accuracy in O(m + kd ω+α ) time

OBSERVATIONS C’C’ τ'τ' A’A’ Error does not accumulate Can loop around many times Unused parameter: size of A ≈k≈k ≈k≈k ≈ O(1), O(K) size increase

OUR APPROACH A AsAs Create shorter matrix A s s.t. total leverage score of each block is close

LEVERAGE SCORE OF A BLOCK A AsAs l 2 2 of leverage scores : Frobenius norm of A 1:k C -1/2 ≈ under random projection G : O(1)-by-k, GA 1:k : O(1) rows ║ τ 1..k ║ 2 2 =║ A 1:k C -1/2 ║ F 2 ≈║ GA 1:k C -1/2 ║ F 2

SIZE REDUCTION Recursing on A s gives leverages scores that: Sum to ≤d Can row sample A A AsAs

ALGORITHM C’C’ τ'τ' A’A’ Decrease size by d α, recurse Bring back leverage scores Reduce error ≈k≈k ≈k≈k ≈ O(1), O(K) size increase

PROBLEM Leverage scores in A s measured using C s = A s T A s Already have bound on total, suffices to show ║ xC -1/2 ║ 2 ≤ k║ xC s -1/2 ║ 2

PROOF SKETCH Show ║ C s 1/2 x | 2 ≤ k║ C 1/2 x ║ 2 Invert both sides Some issues when A s has smaller rank than A Need: ║ xC -1/2 ║ 2 ≤ k║ xC s -1/2 ║ 2

║ C S 1/2 X ║ 2 ≤ K║ C 1/2 X ║ 2 ║ C s 1/2 x ║ 2 =║ A s x ║ 2 = Σ b Σ i ( G i,b A b T x ) 2 2 ≤ Σ b Σ i ║ G i,b ║ 2 2 ║ A b T x ║ 2 2 ≤ max b,i ║ G i,b ║ 2 2 ║ Ax ║ 2 2 ≤ O(klogn)║ Ax ║ 2 2 b: blocks of A s

P=1, OR ARBITRARY P Same approach can still work P-norm leverage scores Need: well-conditioned basis, U for column space ║ Ax ║ p ≈║ A’x ║ p for any x

QUALITY OF BASIS (P=1) Quality of U : maximum distortion in dual norm: β = max x ≠0 ║ Ux ║ ∞ /║ x ║ ∞ Analog of leverage scores: τ i = β║ U i,: ║ 1 Total number of rows: β║ U ║ 1

BASIS CONSTRUCTION Basis using linear transform, U = AC Compute |U i | 1 using p-stable distributions (Indyk `06) instead of JL C 1, U τ'τ' A’A’ ≈k≈k ≈k≈k ≈ O(1), O(K) size increase

ITERATIVE ALGORITHM FOR P=1 C 1 = C -1/2, l 2 basis Quality of U = AC 1 : β║ U i,: ║ 1 = n 1/2 d Too coarse for a single step, but good enough to iterate n approaches poly(d) quickly Need to run l 2 algorithm for C

SUMMARY p=2p=1 Cost for dlog rowscostsize Sohler & Woodruff `11nd ω-1+α d 3.5 Drineals et al. `12ndlogd+d ω Clarkson et al. `12ndlogdd 4.5 log 1.5 d Clarkson & Woodruff `12m+d 3 log 2 dm + d 7 d8d8 Mahoney & Meng `12m+d 3 logdmlogn+d 8 d 3.5 Nelson & Nguyen `12m+d ω Same as above This Paperm + d ω+α d 3.66 Robust steps  algorithms l2: more complicated than sketching Smaller overhead for p-norm

FUTURE WORK What are leverage scores??? Iterative low rank approximation? Better p-norm leverage scores? More streamlined view of the projections in our algorithm? Empirical evaluation?

THANK YOU! Questions?

Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.

Similar presentations

Presentation on theme: "Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.

Similar presentations

Presentation on theme: "Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT."— Presentation transcript:

Similar presentations

About project

Feedback