Sampling: an Algorithmic Perspective Richard Peng M.I.T.

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.

Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.

Multicut Lower Bounds via Network Coding Anna Blasiak Cornell University.

Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.

An Efficient Parallel Solver for SDD Linear Systems Richard Peng M.I.T. Joint work with Dan Spielman (Yale)

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Noga Alon Institute for Advanced Study and Tel Aviv University

Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory.

SDD Solvers: Bridging theory and practice Yiannis Koutis University of Puerto Rico, Rio Piedras joint with Gary Miller, Richard Peng Carnegie Mellon University.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo Department of Combinatorics and Optimization Joint work with Isaac.

Graph Sparsifiers: A Survey Nick Harvey Based on work by: Batson, Benczur, de Carli Silva, Fung, Hariharan, Harvey, Karger, Panigrahi, Sato, Spielman,

Graph Sparsifiers: A Survey Nick Harvey UBC Based on work by: Batson, Benczur, de Carli Silva, Fung, Hariharan, Harvey, Karger, Panigrahi, Sato, Spielman,

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo C&O Joint work with Isaac Fung TexPoint fonts used in EMF. Read.

Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng.

Approximate Undirected Maximum Flows in O(m polylog(n)) Time

Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves

Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.

Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.

L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T.

Hardness Results for Problems

Yiannis Koutis , U of Puerto Rico, Rio Piedras

Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.

Institute for Advanced Study, April Sushant Sachdeva Princeton University Joint work with Lorenzo Orecchia, Nisheeth K. Vishnoi Linear Time Graph.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.

Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.

Spectrally Thin Trees Nick Harvey University of British Columbia Joint work with Neil Olver (MIT  Vrije Universiteit) TexPoint fonts used in EMF. Read.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

NP-Complete problems.

Multifaceted Algorithm Design Richard Peng M.I.T..

Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Generating Random Spanning Trees via Fast Matrix Multiplication Keyulu Xu University of British Columbia Joint work with Nick Harvey TexPoint fonts used.

Algorithm Frameworks Using Adaptive Sampling Richard Peng Georgia Tech.

Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.

Sparsified Matrix Algorithms for Graph Laplacians Richard Peng Georgia Tech.

Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.

Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.

Spectral Methods for Dimensionality

Lp Row Sampling by Lewis Weights

Richard Peng Georgia Tech Michael Cohen Jon Kelner John Peebles

Resparsification of Graphs

CS 290H Administrivia: April 16, 2008

Parallel Algorithm Design using Spectral Graph Theory

Density Independent Algorithms for Sparsifying

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Intro to NP Completeness

CIS 700: “algorithms for Big Data”

Matrix Martingales in Randomized Numerical Linear Algebra

CSE838 Lecture notes copy right: Moon Jung Chung

CSCI B609: “Foundations of Data Science”

Lecture 6: Counting triangles Dynamic graphs & sampling

A Numerical Analysis Approach to Convex Optimization

Lecture 15: Least Square Regression Metric Embeddings

Lecture 22 Network Flow, Part 2

Presentation transcript:

Sampling: an Algorithmic Perspective Richard Peng M.I.T.

OUTLINE Structure preserving sampling Sampling as a recursive ‘driver’ Sampling the inaccessible What can sampling preserve?

RANDOM SAMPLING Collection of many objects Pick a small subset of them Goal: Estimate quantities Small approximates Use in algorithms

SAMPLING CAN APPROXIMATE Point sets Matrices Graphs Gradients

PRESERVING GRAPH STRUCTURES Undirected graph, n vertices, m < n 2 edges Is n 2 edges (dense) sometimes necessary? For some information, e.g. connectivity: encoded by spanning forest, < n edges Deterministic, O(m) time algorithm : questions

MORE INTRICATE STRUCTURES k-connectivity: # of disjoint paths between s-t [Benczur-Karger `96]: for ANY G, can sample to get H with O(nlogn) edges s.t. G ≈ H on all cuts Stronger: weights of all 2 n cuts in graphs Cut: # of edges leaving a subset of vertices s t Menger’s theorem / maxflow- mincut : previous works ≈: multiplicative approximation

n n MORE GENERAL: ROW SAMPLING A’ A L 2 Row sampling: Given A with m>>n, sample a few rows to form A ’ s.t.║ Ax ║ 2 ≈║ A’x ║ 2 ∀ x m ≈n ║ Ax ║ p : finite dimensional Banach space Sampling: embedding Banach spaces e.g. [BLM `89], [Talagrand `90]

HOW TO SAMPLE? Widely used: uniform sampling Works well when data is uniform e.g. complete graph Problem: long path, removing any edge changes connectivity (can also have both in one graph) More systematic view of sampling?

SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE [Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance Effective resistance: commute time / m Statistical leverage score in unweighted graphs

L 2 MATRIX-CHERNOFF BOUNDS [Foster `49] Σ i τ i = rank ≤ n  O( nlogn) rows [Rudelson, Vershynin `07], [Tropp `12]: sampling with p i ≥ τ i O( logn) gives B’ s.t. ║ Bx ║ 2 ≈║ B’x ║ 2 ∀ x w.h.p. τ : L 2 statistical leverage scores τ i = b i T ( B T B ) -1 b i = ║ b i ║ 2 L -1 Near optimal: L 2 -row samples of B Graph sparsifiers In practice O(logn)  5 usually suffices can also improve via derandomization

THE `RIGHT’ PROBABILITIES Only one non-zero rowColumn with one entry n/m 1 Path + clique: 1 1/n τ : L 2 statistical leverage scores τ i = b i T ( B T B ) -1 b i = ║ b i ║ 2 L -1 Any good upper bounds to τ i lead to size reductions

OUTLINE Structure preserving sampling Sampling as a recursive ‘driver’ Sampling the inaccessible What can sampling preserve?

ALGORITHMIC TEMPLATES W-cycle: T(m) = 2T(m/2) + O(m) V-cycle: T(m) = T(m/2) + O(m) Instances: Sorting FFT Voronoi / Delaunay Instances: Selection Parallel indep. Set Routing

Difficulty: Exists many non-separable graphs Easy to compose hard instances EFFICIENT GRAPH ALGORITHMS Partition via separators

SIZE REDUCTION Ultra-sparsifier: for any k, can find H ≈ k G that’s tree + O(mlog c n/k) edges ` ` e.g. [Koutis-Miller-P `10]: obtain crude estimates on τ i via a tree H equivalent to graph of size O(mlog c n/k) Picking k > log c n gives reductions : my results

INSTANCE: Lx = b Input : graph Laplacian L, vector b Output : x ≈ ε L + b Runtimes [KMP `10, `11]: O(mlogn) work, O(m 1/3 ) depth [CKPPR`14, CMPPX`14]: O(mlog 1/2 n) work, O(m 1/3 ) depth Note: L + : pseudo-inverse Approximate solution Omitting log(1/ε) + recursive Chebyshev iteration: T(m) = k 1/2 (T(mlog c n/k) + O(m))

INSTANCE: INPUT-SPARSITY TIME NUMERICAL ALGORITHMS Similar: Nystrom method sample post-process [Li-Miller-P 13]: Create smaller approximation Recurse on it Bring solution back

INSTANCE: APPROX MAXFLOW Absorb additional (small) error via more calls to approximator Recurse on instances with smaller total size, total cost: O(mlog c n) [P`14]: build approximator on the smaller graph [Racke-Shah-Taubig `14] good approximator by solving maxflows [Sherman `13] [KLOS `14]: structure approximators  fast maxflow routines

OUTLINE Structure preserving sampling Sampling as a recursive ‘driver’ Sampling the inaccessible What can sampling preserve?

DENSE OBJECTS Matrix inverse Schur complement K-step random walks Cost-prohibitive to store Application of separators Directly access sparse approximates?

TWO STEP RANDOM WALKS A : step of random walk Still a graph, can sparsify! A 2 : 2 step random walk

WHAT THIS ENABLED [P-Spielman `14] use this to approximate ( I – A ) -1 = ( I + A ) ( I + A 2 ) ( I + A 4 )… Similar to multi-level methods Skipping: control / propagation of error Combining known tools: efficiently sparsify I – A 2 without computing A 2 [Cheng-Cheng-Liu-P-Teng `15]: sparsified Newton’s method for matrix roots and Gaussian sampling

MATRIX SQUARING ConnectivityMore general Iteration A i+1 ≈ A i 2 I - A i+1 ≈ I - A i 2 Until ║ A d ║ small Size ReductionLow degreeSparse graph MethodDerandomizedRandomized Solution transferConnectivitySolution vectors NC algorithm for shortest path Logspace connectivity: [Reingold `02] Deterministic squaring: [Rozenman-Vadhan `05]

LONGER RANDOM WALKS A : one step of random walk A 3 : 3 steps of random walk (part of) edge uv in A 3 Length 3 path in A : u-y-z-v

PSEUDOCODE Repeat O(cmlognε -2 ) times: 1.Uniformly randomly pick 1 ≤ k ≤ c and edge e = uv 2.Perform (k -1)-step random walk from u. 3.Perform (r - k)-step random walk from v. 4.Add a scaled copy of the edge to the sparsifier Resembles: Local clustering Approximate triangle counting (c = 3) [Cheng-Cheng-Liu-P-Teng `15]: combine this with repeated squaring to approximate any random walk polynomial in nearyl-linear time.

GAUSSIAN ELIMINATION [Lee-P-Spielman, in progress] approximate such circuits in O(mlog c n) time Partial state of Gaussian elimination: linear system on a subset of variables Graph theoretic interpretation: equivalent circuit on boundaries, Y-Δ transform

WHAT THIS ENABLES [Lee-P-Spielman, in progress] O(n) time approximate Cholesky factorization for graph Laplacians [Lee-Sun, `15] constructible in nearly-linear work

OUTLINE Structure preserving sampling Sampling as a recursive ‘driver’ Sampling the inaccessible What can sampling preserve?

MORE GENERAL STRUCTURES Non-linear structures Directed constraints: Ax ≤ b

║y║1║y║1 ║y║2║y║2 OTHER NORMS Generalization of row sampling: given A, q, find A ’ s.t.║ Ax ║ q ≈║ A’x ║ q ∀ x 1-norm: standard for representing cuts, used in sparse recovery / robust regression Applications (for general A ): Feature selection Low rank approximation / PCA q-norm: ║ y ║ q = (Σ| y i | q ) 1/q

L 1 ROW SAMPLING L 1 Lewis weights ([Lewis `78]): w s.t. w i 2 = a i T ( A T W -1 A ) -1 a i Recursive definition! [Sampling with p i ≥ w i O( logn) gives ║ Ax ║ 1 ≈ ║ A’x ║ 1 ∀ x Can check: Σ i w i ≤ n  O(nlogn) rows [Talagrand `90, “Embedding subspaces of L 1 into L N 1 ” ] can be analyzed as row-sampling / sparsification [Cohen-P `15] w ’ i  ( a i T ( A T W -1 A ) -1 a i ) 1/2 Converges in loglogn steps

WHERE THIS FITS IN #rows for q=2 #rows for q=1 Runtime Dasgupta et al. `09n 2.5 mn 5 Magdon-Ismail `10nlog 2 nmn 2 Sohler-Woodruff `11n 3.5 mn ω-1+θ Drineas et al. `12nlognmnlogn Clarkson et al. `12n 4.5 log 1.5 nmnlogn Clarkson-Woodruff `12n 2 lognn8n8 nnz Mahoney-Meng `12n2n2 n 3.5 nnz+n 6 Nelson-Nguyen `12n 1+θ nnz Li et.`13nlognn 3.66 nnz+n ω+θ Cohen et al. 14, Cohen-P `15 nlogn nnz+n ω+θ [Cohen-P `15] Elementary, optimization motivated proof of w.h.p. concentration for L 1

CONNECTION TO LEARNING THEORY Sparsely-used Dictionary Learning: given Y, find A, X so that ║ Y - AX ║ is small and X is sparse [Spielman-Wang-Wright `12]: L 1 regression solves this using about n 2 samples [Luh-Vu `15]: generic chaining: O(nlog 4 n) samples suffice Proof in [Cohen-P `15] gives O(nlog 2 n) samples Key: if X satisfies the Bernoulli-Subgaussian model, then ║ Xy ║ 1 is close to expectation for all y ‘Right’ bound should be O(nlogn)

UNSPARSIFIABLE INSTANCE Complete bipartite graph: Removing any edge u  v makes v unreaclable from u Preserve less structure?

WEAKER REQUIREMENT Sample only needs to make gains in some directions Q1Q1 P Q2Q2 [Cohen-Kyng-Pachocki-P-Rao `14]: point-wise convergence without matrix concentration

UNIFORM SAMPLING? Nystrom method (on matrices): Pick random subset of data Compute on subset Post-process result Post-processing: Theoretical works before us: copy x over Practical: projection, least-squares fitting [CLMMPS `15]: half the rows as A ’ gives good sampling probabilities for A that sum to ≤ 2n How powerful is (recursive) post-processing?

WHY IS THIS EFFECTIVE? Needle in a haystack: only d dimensions, can’t have too many, easy to find via post-process Hay in a haystack: half the data should still contain some info

FUTURE WORK More concretely: More sparsification based algorithms? E.g. multi-grid maxflow? Sampling directed graphs Hardness results? What structures can sampling preserve? What do sampling need to preserve?