Approximate computation and implicit regularization in large-scale data analysis Michael W. Mahoney Stanford University Sept 2011 (For more info, see:

Slides:

Advertisements

Similar presentations

1 LP, extended maxflow, TRW OR: How to understand Vladimirs most recent work Ramin Zabih Cornell University.

Advertisements

Statistical perturbation theory for spectral clustering Harrachov, 2007 A. Spence and Z. Stoyanov.

Introduction to Algorithms

Jure Leskovec (Stanford) Kevin Lang (Yahoo! Research) Michael Mahoney (Stanford)

Implementing regularization implicitly via approximate eigenvector computation Michael W. Mahoney Stanford University (Joint work with Lorenzo Orecchia.

ISE480 Sequencing and Scheduling Izmir University of Economics ISE Fall Semestre.

Geometric Network Analysis Tools Michael W. Mahoney Stanford University MMDS, June 2010 ( For more info, see: cs.stanford.edu/people/mmahoney/

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Geometric embeddings and graph expansion James R. Lee Institute for Advanced Study (Princeton) University of Washington (Seattle)

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep

Analysis of Network Diffusion and Distributed Network Algorithms Rajmohan Rajaraman Northeastern University, Boston May 2012 Chennai Network Optimization.

Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.

Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.

Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.

1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.

1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.

EDA (CS286.5b) Day 6 Partitioning: Spectral + MinCut.

Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.

Analysis of Algorithms CS 477/677

Chapter 11: Limitations of Algorithmic Power

Approximation Algorithms as “Experimental Probes” of Informatics Graphs Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/

Extracting insight from large networks: implications of small-scale and large-scale structure Michael W. Mahoney Stanford University ( For more info, see:

Packing Element-Disjoint Steiner Trees Mohammad R. Salavatipour Department of Computing Science University of Alberta Joint with Joseph Cheriyan Department.

Eigenvectors of random graphs: nodal domains James R. Lee University of Washington Yael Dekel and Nati Linial Hebrew University TexPoint fonts used in.

Advanced Image Processing Image Relaxation – Restoration and Feature Extraction 02/02/10.

Integrality Gaps for Sparsest Cut and Minimum Linear Arrangement Problems Nikhil R. Devanur Subhash A. Khot Rishi Saket Nisheeth K. Vishnoi.

1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

C&O 355 Mathematical Programming Fall 2010 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A.

Sensors, networks, and massive data Michael W. Mahoney Stanford University May 2012 ( For more info, see: cs.stanford.edu/people/mmahoney/ or Google.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct

Institute for Advanced Study, April Sushant Sachdeva Princeton University Joint work with Lorenzo Orecchia, Nisheeth K. Vishnoi Linear Time Graph.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Approximation Algorithms Department of Mathematics and Computer Science Drexel University.

Spectral graph theory Network & Matrix Computations CS NMC David F. Gleich.

Implicit regularization in sublinear approximation algorithms Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.

Fan Chung University of California, San Diego The PageRank of a graph.

C&O 355 Mathematical Programming Fall 2010 Lecture 16 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

NP-Complete problems.

Quality of LP-based Approximations for Highly Combinatorial Problems Lucian Leahu and Carla Gomes Computer Science Department Cornell University.

CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.

C&O 355 Lecture 24 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A A A.

Graph Partitioning using Single Commodity Flows

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.

STATIC ANALYSIS OF UNCERTAIN STRUCTURES USING INTERVAL EIGENVALUE DECOMPOSITION Mehdi Modares Tufts University Robert L. Mullen Case Western Reserve University.

Multi-way spectral partitioning and higher-order Cheeger inequalities University of Washington James R. Lee Stanford University Luca Trevisan Shayan Oveis.

Presented by Alon Levin

C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.

A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.

Approximation Algorithms Duality My T. UF.

Linear Programming Piyush Kumar Welcome to CIS5930.

Approximation Algorithms based on linear programming.

Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.

Data Driven Resource Allocation for Distributed Learning

Lap Chi Lau we will only use slides 4 to 19

Intrinsic Data Geometry from a Training Set

Topics in Algorithms Lap Chi Lau.

Structural Properties of Low Threshold Rank Graphs

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

(For more info, see: Approximate computation and implicit regularization in large-scale data analysis Michael W.

Presentation transcript:

Approximate computation and implicit regularization in large-scale data analysis Michael W. Mahoney Stanford University Sept 2011 (For more info, see:

Motivating observation Theory of NP-completeness is a very useful theory captures qualitative notion of “fast,” provides an qualitative guidance to how algorithms perform in practice, etc. LP, simplex, ellipsoid, etc. - the “exception that proves the rule” Theory of Approximation Algorithms is NOT analogously useful (at least for many machine learning and data analysis problems) bounds very weak; can’t get constants; dependence on parameters not qualitatively right; does not provide qualitative guidance w.r.t. practice; usually want a vector/graph achieving optimum, bu don’t care about the particular vector/graph; etc.

Start with the conclusions Modern theory of approximation algorithms is often NOT a useful theory for many large-scale data analysis problem Approximation algorithms and heuristics often implicitly perform regularization, leading to “more robust” or “better” solutions Can characterize the regularization properties implicit in worst- case approximation algorithms Take-home message: Solutions of approximation algorithms don’t need to be something we “settle for,” since they can be “better” than the solution to the original intractable problem

Algorithmic vs. Statistical Perspectives Computer Scientists Data: are a record of everything that happened. Goal: process the data to find interesting patterns and associations. Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard. Statisticians (and Natural Scientists) Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world. Goal: is to extract information about the world from noisy data. Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic model. Lambert (2000)

Statistical regularization (1 of 2) Regularization in statistics, ML, and data analysis arose in integral equation theory to “solve” ill-posed problems computes a better or more “robust” solution, so better inference involves making (explicitly or implicitly) assumptions about the data provides a trade-off between “solution quality” versus “solution niceness” often, heuristic approximation have regularization properties as a “side effect”

Statistical regularization (2 of 2) Usually implemented in 2 steps: add a norm constraint (or “geometric capacity control function”) g(x) to objective function f(x) solve the modified optimization problem x’ = argmin x f(x) + g(x) Often, this is a “harder” problem, e.g., L1-regularized L2-regression x’ = argmin x ||Ax-b|| 2 + ||x|| 1

Two main results Big question: Can we formalize the notion that/when approximate computation can implicitly lead to “better” or “more regular” solutions than exact computation? Approximate first nontrivial eigenvector of Laplacian Three random-walk-based procedures (heat kernel, PageRank, truncated lazy random walk) are implicitly solving a regularized optimization exactly! Spectral versus flow-based approximation algorithms for graph partitioning Theory suggests each should regularize in different ways, and empirical results agree!

Approximating the top eigenvector Basic idea: Given a Laplacian matrix A, Power method starts with v 0, and iteratively computes v t+1 = Av t / ||Av t || 2. Then, v t =  i  i t v i -> v 1. If we truncate after (say) 3 or 10 iterations, still have some mixing from other eigen-directions What objective does the exact eigenvector optimize? Rayleigh quotient R(A,x) = x T Ax /x T x, for a vector x. But can also express this as an SDP, for a SPSD matrix X. (We will put regularization on this SDP!)

Views of approximate spectral methods Three common procedures (L=Laplacian, and M=r.w. matrix): Heat Kernel: PageRank: q-step Lazy Random Walk: Ques: Do these “approximation procedures” exactly optimizing some regularized objective?

Two versions of spectral partitioning VP: R-VP:

Two versions of spectral partitioning VP: SDP: R-SDP: R-VP:

A simple theorem Modification of the usual SDP form of spectral to have regularization (but, on the matrix X, not the vector x). Mahoney and Orecchia (2010)

Three simple corollaries F H (X) = Tr(X log X) - Tr(X) (i.e., generalized entropy) gives scaled Heat Kernel matrix, with t =  F D (X) = -logdet(X) (i.e., Log-determinant) gives scaled PageRank matrix, with t ~  F p (X) = (1/p)||X|| p p (i.e., matrix p-norm, for p>1) gives Truncated Lazy Random Walk, with ~  Answer: These “approximation procedures” compute regularized versions of the Fiedler vector exactly!

Graph partitioning A family of combinatorial optimization problems - want to partition a graph’s nodes into two sets s.t.: Not much edge weight across the cut (cut quality) Both sides contain a lot of nodes Several standard formulations: Graph bisection (minimum cut with balance)  -balanced bisection (minimum cut with balance) cutsize/min{|A|,|B|}, or cutsize/(|A||B|) (expansion) cutsize/min{Vol(A),Vol(B)}, or cutsize/(Vol(A)Vol(B)) (conductance or N-Cuts) All of these formalizations of the bi-criterion are NP-hard!

The “lay of the land” Spectral methods* - compute eigenvectors of associated matrices Local improvement - easily get trapped in local minima, but can be used to clean up other cuts Multi-resolution - view (typically space-like graphs) at multiple size scales Flow-based methods* - single-commodity or multi- commodity version of max-flow-min-cut ideas *Comes with strong underlying theory to guide heuristics.

Comparison of “spectral” versus “flow” Spectral: Compute an eigenvector “Quadratic” worst-case bounds Worst-case achieved -- on “long stringy” graphs Worse-case is “local” property Embeds you on a line (or K n ) Flow: Compute a LP O(log n) worst-case bounds Worst-case achieved -- on expanders Worst case is “global” property Embeds you in L1 Two methods -- complementary strengths and weaknesses What we compute is determined at least as much by as the approximation algorithm as by objective function.

Explicit versus implicit geometry Explicitly- imposed geometry Traditional regularization uses explicit norm constraint to make sure solution vector is “small” and not-too-complex (X,d) (X’,d’) x y d(x,y) f f(x) f(y) Implicitly-imposed geometry Approximation algorithms implicitly embed the data in a “nice” metric/geometric place and then round the solution.

Regularized and non-regularized communities Metis+MQI - a Flow-based method (red) gives sets with better conductance. Local Spectral (blue) gives tighter and more well-rounded sets. External/internal conductance Diameter of the cluster Conductance of bounding cut Local Spectral Connected Disconnected Lower is good

Conclusions Modern theory of approximation algorithms is often NOT a useful theory for many large-scale data analysis problem Approximation algorithms and heuristics often implicitly perform regularization, leading to “more robust” or “better” solutions Can characterize the regularization properties implicit in worst- case approximation algorithms Take-home message: Solutions of approximation algorithms don’t need to be something we “settle for,” since they can be “better” than the solution to the original intractable problem