Download presentation
Presentation is loading. Please wait.
Published byHector Jordan Modified over 9 years ago
1
Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion Yishay Mansour, TAU and MSR Ohad Shamir, Weizmann
2
Nonstochastic sequential decision-making K actions and T time steps l t (a) – loss of action a at time t At time t – player picks action X t – incurs loss l t (X t ) – observe feedback on losses Multi-arm bandit: only l t (X t ) Experts (full information): l t (j) for any j 2
3
Nonstochastic sequential decision-making Goal: – minimize losses – benchmark: The best single action The action j that minimizes the loss – no stochastic assumptions on losses Regret Known regret bounds: – MAB – Experts 3
4
Motivation – observablity undirecteddirected 4
5
undirected observation graph ? ? ? ? ? ? ? ? 5
6
? 3 ? ? ? ? ? ? 6
7
5 3 ? 1 ? 7 ? ? 7
8
MAB: no edges Experts: clique ? 3 ? ? ? ? ? ? 5 3 6 1 4 7 8 2 8
9
Modeling Directed vs Undirected Different types of dependencies Different measures – Independent set – Dominating set – Max Acyclic Subgraph Informed vs Uniformed When does the learner observes the graph – Before – After only the neighbors 9
10
Our Results Uniformed setting Undirected graph Uniformed setting – Only the neighbors of the node – Independent sets Directed graph – Max Acyclic Subgraph (not tight) – Random Erdos-Renyi graphs Informed setting Directed graphs Regret characterization – dominating sets and ind. set Both expectation and high prob. 10
11
EXP3-SET Online Algorithm where Theorem 11
12
EXP3-Set Regret – key lemma Lemma Note: MAB: Q=K Full info. Q=1 Proof: Build an i.s. S – consider action a with minimal Pr[a observed] – Add a to S – Delete a and its neighbors Note 12
13
EXP3-SET directed case directed graph – Lemma does not hold Example: – Tournament graph j i iff j<i – probabilities p i =2 -i – α(G)=1 Random graph – Erdos-Renyi edge parameter r – Regret – MAB r=0; Experts r=1 – Note 13
14
EXP3-SET directed case Upper bound – directed mas(G)=maximum acyclic subgraph of G Tournament – mas(G)=K and α(G)=1 Regret Lower bound - directed Any fixed graph G Regret the graph in advance 14
15
Dominating set – directed graph ? ? ? ? ? ? ? ? 15
16
Dominating set – directed graph ? ? ? ? ? ? ? ? 16
17
EXP3-DOM Simplified version – fixed graph G – D is dominating set log approx Main modification – add probabilities to D induce observability probabilities: Select X t using p t Observe l t (a) for a in S Xt,t weights 17
18
EXP3-DOM Simple example Transitive observability – tournament action 1 observes all actions – D={1} EXP3-DOM Sample action 1 with prob γ – action 1 is the exploration Otherwise run a MAB – specifically EXP3-SET Intuition – action 1 replaces mixture with uniform 18
19
Conclusion Observability model – Between MAB and Experts more work to be done Uninformed setting – Undirected graph Informed setting – Directed graph [Kocak, Neu, Valko and R. Muno] improved uniformed 19
21
EXP3-DOM – main Theorem Theorem: tuning γ Corollary 21
22
EXP3-DOM – main Theorem Theorem: tuning γ Corollary 22
23
Outline Model and motivation symmetric observability non-symmetric observability 23
24
EXP3-DOM: key lemma Lemma – G directed graph, – d - i indegree of i, – α=α(G) Turan’s Theorem – undirected graph G(V,E) Proof: high level – shrink graph G K,G k-1, … – delete nodes step s: – delete max indegree node From Turan’s theorem 24
25
EXP3-DOM: key lemma (proof) Completing the proof Note, due to edge elimination 25
26
EXP3-DOM- Key lemma (modified) Lemma (what we really need!) G(V,E) directed graph – IN i indegree of i – r size dominating set; and α size ind. set – p distribution over V p i ≥β 26
27
EXP3 –DOM: changing graphs Simple – all dom. set same size – approx. same size Problem – different size dom. set can be 1 or K Solution – keep log levels depend on log 2 (D t ) – algorithm per level Complications – parameters depend on level – setting the learning rate need a delicate doubling Main tech. challenge – handle dynamic adversary. 27
28
EXP3-DOM receive obs. graph – find dominating set D t logarithmic approximation Run the right copy – Let b t = log 2 (D t ) – run copy b t log copies For Copy b t – param. depend on b t probabilities: Select X t using p Observe l t (a) for a in S Xt,t weights 28
29
EXP3-DOM – main Theorem Theorem: tuning γ b 29
30
Independent set Independent set α(G) [Mannor & Shamir 2012] Tight Regret – α(G) “replaces” K Cons: – requires to observe G – solves an LP each step ? ? ? ? ? ? ? ? 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.