online convex optimization (with partial information)

Slides:



Advertisements
Similar presentations
Blind online optimization Gradient descent without a gradient Abie Flaxman CMU Adam Tauman Kalai TTI Brendan McMahan CMU.
Advertisements

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
6.896: Topics in Algorithmic Game Theory Lecture 11 Constantinos Daskalakis.
Regret Minimizing Audits: A Learning-theoretic Basis for Privacy Protection Jeremiah Blocki, Nicolas Christin, Anupam Datta, Arunesh Sinha Carnegie Mellon.
Taming the monster: A fast and simple algorithm for contextual bandits
No-Regret Algorithms for Online Convex Programs Geoffrey J. Gordon Carnegie Mellon University Presented by Nicolas Chapados 21 February 2007.
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Projection-free Online Learning
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Basic Feasible Solutions: Recap MS&E 211. WILL FOLLOW A CELEBRATED INTELLECTUAL TEACHING TRADITION.
The General Linear Model. The Simple Linear Model Linear Regression.
1 Minimum Ratio Contours For Meshes Andrew Clements Hao Zhang gruvi graphics + usability + visualization.
A Randomized Polynomial-Time Simplex Algorithm for Linear Programming Daniel A. Spielman, Yale Joint work with Jonathan Kelner, M.I.T.
Visual Recognition Tutorial
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
An Algorithm for Polytope Decomposition and Exact Computation of Multiple Integrals.
Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.
Semidefinite Programming
1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
CISS Princeton, March Optimization via Communication Networks Matthew Andrews Alcatel-Lucent Bell Labs.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Continuous Random Variables and Probability Distributions
Advanced Topics in Optimization
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Optimization Methods One-Dimensional Unconstrained Optimization
Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,
Ch. 11: Optimization and Search Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 some slides from Stephen Marsland, some images.
Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.
UNCONSTRAINED MULTIVARIABLE
C&O 355 Mathematical Programming Fall 2010 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A.
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
Particle Filtering in Network Tomography
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
ENCI 303 Lecture PS-19 Optimization 2
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Linear Programming Data Structures and Algorithms A.G. Malamos References: Algorithms, 2006, S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani Introduction.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Optimal Path Planning Using the Minimum-Time Criterion by James Bobrow Guha Jayachandran April 29, 2002.
DEPARTMENT/SEMESTER ME VII Sem COURSE NAME Operation Research Manav Rachna College of Engg.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Approximation Algorithms based on linear programming.
Regularized Least-Squares and Convex Optimization.
Honors Track: Competitive Programming & Problem Solving Seminar Topics Kevin Verbeek.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Linear Programming Many problems take the form of maximizing or minimizing an objective, given limited resources and competing constraints. specify the.
Lap Chi Lau we will only use slides 4 to 19
Deep Feedforward Networks
Stochastic Streams: Sample Complexity vs. Space Complexity
Topics in Algorithms Lap Chi Lau.
Generalization and adaptivity in stochastic convex optimization
Linear Programming.
Polyhedron Here, we derive a representation of polyhedron and see the properties of the generators. We also see how to identify the generators. The results.
Chapter 6. Large Scale Optimization
CSCI B609: “Foundations of Data Science”
Polyhedron Here, we derive a representation of polyhedron and see the properties of the generators. We also see how to identify the generators. The results.
15th Scandinavian Workshop on Algorithm Theory
Chapter 6. Large Scale Optimization
Presentation transcript:

online convex optimization (with partial information) Elad Hazan @ IBM Almaden Joint work with Jacob Abernethy and Alexander Rakhlin @ UC Berkeley

Online routing Edge weights (congestion) – not known in advance. Can change arbitrarily between 0-10 (say) source destination

Online routing Iteratively: source Pick path (not knowing network congestion), then see length of path. source 3 destination

Online routing Iteratively: source Pick path (not knowing network congestion), then see length of path. source 5 destination

Online routing Iteratively: source Pick path (not knowing network congestion), then see length of path. source destination This talk: efficient and optimal algorithm for online routing

Why is the problem hard ? Three sources of difficulty: Prediction problem – future unknown, unrestricted Partial information – performance of other decisions unknown Efficiency: exponentially large decision set Similar situations arise in Production, web retail, Advertising, Online routing…

Technical Agenda Describe classical general framework Where classical approaches come short Modern, Online convex optimization framework Our new algorithm, and a few technical details

Multi-armed bandit problem Iteratively, for time t=1,2,… : Adversary chooses payoffs on the machines rt Player picks a machine it Loss is revealed rt(it) Process is repeated T times ! (with different, unknown, arbitrary losses every round) Goal: Minimize the regret Regret = cost(ALG) – cost(smallest-cost fixed machine) = t rt(it) - minj t rt(j) Online learning problem - design efficient algorithms with small regret 1 5 -3 5

Multi-armed bandit problem Studied in game theory, statistics (as early as 1950’s), more recently machine learning. Robbins (1952) (statistical setting), Hannan (1957) – game theoretic setting, full information. [ACFS ’98] – adversarial bandits: 1 5 -3 5

Multi-armed bandit problem [ACFS ’98] – adversarial bandits: Regret = O(√ T √ K ) for K machines, T game iterations. Optimal up to the constant. 1 5 -3 5 What’s left to do ? Exponential regret & run time for routing (K = # of paths in the graph) Our result: Efficient algorithm with √ T n Regret (n = # of edges in graph)

Online convex optimization Distribution over paths = flow in the graph The set of all flows = P µ Rn = polytope of flows This polytope has a compact representation, as the number of facets = number of constraints for flows in graphs is small. (The flow constraints, flow conservation and edge capacity) f1(x1) f1 x1 f2(x2) x2 xT f2 fT(xT) fT At this stage you should be appalled… Convex bounded functions (for this talk, linear functions) Total loss = t ft(xt) Can express online routing as OCO over polytope in Rn with O(n) constraints (despite exp’ large decision set)

Online performance metric - regret fT x* minima for t ft(x) (fixed optimum in hindsight) Loss of our algorithm = t ft(xt) Regret = Loss(ALG) – Loss(OPT point) = t ft(xt) - t ft(x*)

Requires complete knowledge of cost functions Existing techniques Adaboost, Online Gradient Descent, Weighted Majority, Online Newton Step, Perceptron, Winnow… All can be seen as special case of “Follow the regularized leader” At time t predict: Requires complete knowledge of cost functions

Our algorithm - template Two generic parts: Compute the “Regularized leader” In order to make the above work, estimate functions gt such that E[gt] = ft. Note: we would prefer to use But we do not have access to ft !

Our algorithm - template Compute the current prediction xt based on previous observations. Play random variable yt, taken from a distribution such that Distribution is centered at prediction: E[yt] = xt From the information ft(yt), we can create a random variable gt, such that E[gt] = ft The variance of the random variable gt needs to be small

Simple example: interpolating the cost function Want to compute & use: what is ft ? (we only see ft(xt)) Convex set = 2-dim simplex (dist on two endpoints)

Simple example: interpolating the cost function Want to compute & use: what is ft ? (we only see ft(xt)) ft(1) Convex set = 2-dim simplex (dist on two endpoints) x1 * ft(1) + x2 * ft(2) x1 ft(2) x2

Simple example: interpolating the cost function Unbiased estimate of ft is just as good Expected value remains x1 * ft(1) + x2 * ft(2) With prob. x1 ft(1) x1 Note that E[gt] = ft With prob. x2 ft(2) x2

The algorithm – complete definition Let R(x) be a self concordant barrier for the convex set Compute current prediction center: Compute eigenvectors of Hessian at current point. Consider the intersection of eigenvectors and the Dikin ellipsoid Sample uniformly from eigendirections, to obtain unbiased estimates E[gt] = ft Remember that we need the random variables yt,gt to satisfy: E[yt] = xt E[gt] = ft The variance of the random variable gt needs to be small This is where self-concordance is crucial

An extremely short survey: IP methods Probably most widely used algorithms today “IP polynomial algorithms in convex programming” [NN ‘94] To solve a general optimization problem: minimize convex function over a convex set (specified by constraints), reduce to unconstrained optimization via a self concordant barrier function min f(x) A1 ¢ x - b1 · 0 … Am¢ x - bm · 0 x 2 Rn min f(x) -  £ i log(bi - Ai x) x 2 Rn Barrier function R(x)

The geometric properties of self concordant functions Let R be self concordant for convex set K, then at each point x, the hessian of R at x defines a local norm. The Dikin ellipsoid Fact: for all x in the set, DR(x) µ K The Dikin ellipsoid characterizes “space to the boundary” in every direction It is tight:

Unbiased estimators from self-concordant functions Eigendirections of Hessian: create unbiased gradient estimator in each direction Orthogonal basis – complete estimator Self concordance is important for: Capture the geometry of the set (Dikin ellipsoid) Control the variance of the estimator

Online routing – the final algorithm xt Xt+1 yt 3 Flow decomposition for yt Obtain path pt Guess path, Observe length

Summary Online optimization algorithm with limited feedback Optimal regret – O(√T n) Efficient (n3 time per iteration) More efficient implementation based on IP theory Open Questions: Dependence on dimension is not known to be optimal Algorithm has large variance – T2/3 , reduce the variance to √T ? Adaptive adversaries The End