Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion Yishay Mansour, TAU and MSR Ohad Shamir, Weizmann

Nonstochastic sequential decision-making K actions and T time steps l t (a) – loss of action a at time t At time t – player picks action X t – incurs loss l t (X t ) – observe feedback on losses Multi-arm bandit: only l t (X t ) Experts (full information): l t (j) for any j 2

Nonstochastic sequential decision-making Goal: – minimize losses – benchmark: The best single action The action j that minimizes the loss – no stochastic assumptions on losses Regret Known regret bounds: – MAB – Experts 3

Motivation – observablity undirecteddirected 4

undirected observation graph ? ? ? ? ? ? ? ? 5

? 3 ? ? ? ? ? ? 6

5 3 ? 1 ? 7 ? ? 7

MAB: no edges Experts: clique ? 3 ? ? ? ? ? ? 5 3 6 1 4 7 8 2 8

Modeling Directed vs Undirected Different types of dependencies Different measures – Independent set – Dominating set – Max Acyclic Subgraph Informed vs Uniformed When does the learner observes the graph – Before – After only the neighbors 9

Our Results Uniformed setting Undirected graph Uniformed setting – Only the neighbors of the node – Independent sets Directed graph – Max Acyclic Subgraph (not tight) – Random Erdos-Renyi graphs Informed setting Directed graphs Regret characterization – dominating sets and ind. set Both expectation and high prob. 10

EXP3-SET Online Algorithm where Theorem 11

EXP3-Set Regret – key lemma Lemma Note: MAB: Q=K Full info. Q=1 Proof: Build an i.s. S – consider action a with minimal Pr[a observed] – Add a to S – Delete a and its neighbors Note 12

EXP3-SET directed case directed graph – Lemma does not hold Example: – Tournament graph j  i iff j<i – probabilities p i =2 -i – α(G)=1 Random graph – Erdos-Renyi edge parameter r – Regret – MAB r=0; Experts r=1 – Note 13

EXP3-SET directed case Upper bound – directed mas(G)=maximum acyclic subgraph of G Tournament – mas(G)=K and α(G)=1 Regret Lower bound - directed Any fixed graph G Regret the graph in advance 14

Dominating set – directed graph ? ? ? ? ? ? ? ? 15

Dominating set – directed graph ? ? ? ? ? ? ? ? 16

EXP3-DOM Simplified version – fixed graph G – D is dominating set log approx Main modification – add probabilities to D induce observability probabilities: Select X t using p t Observe l t (a) for a in S Xt,t weights 17

EXP3-DOM Simple example Transitive observability – tournament action 1 observes all actions – D={1} EXP3-DOM Sample action 1 with prob γ – action 1 is the exploration Otherwise run a MAB – specifically EXP3-SET Intuition – action 1 replaces mixture with uniform 18

Conclusion Observability model – Between MAB and Experts more work to be done Uninformed setting – Undirected graph Informed setting – Directed graph [Kocak, Neu, Valko and R. Muno] improved uniformed 19

EXP3-DOM – main Theorem Theorem: tuning γ Corollary 21

EXP3-DOM – main Theorem Theorem: tuning γ Corollary 22

Outline Model and motivation symmetric observability non-symmetric observability 23

EXP3-DOM: key lemma Lemma – G directed graph, – d - i indegree of i, – α=α(G) Turan’s Theorem – undirected graph G(V,E) Proof: high level – shrink graph G K,G k-1, … – delete nodes step s: – delete max indegree node From Turan’s theorem 24

EXP3-DOM: key lemma (proof) Completing the proof Note, due to edge elimination 25

EXP3-DOM- Key lemma (modified) Lemma (what we really need!) G(V,E) directed graph – IN i indegree of i – r size dominating set; and α size ind. set – p distribution over V p i ≥β 26

EXP3 –DOM: changing graphs Simple – all dom. set same size – approx. same size Problem – different size dom. set can be 1 or K Solution – keep log levels depend on  log 2 (D t )  – algorithm per level Complications – parameters depend on level – setting the learning rate need a delicate doubling Main tech. challenge – handle dynamic adversary. 27

EXP3-DOM receive obs. graph – find dominating set D t logarithmic approximation Run the right copy – Let b t =  log 2 (D t )  – run copy b t log copies For Copy b t – param. depend on b t probabilities: Select X t using p Observe l t (a) for a in S Xt,t weights 28

EXP3-DOM – main Theorem Theorem: tuning γ b 29

Independent set Independent set α(G) [Mannor & Shamir 2012] Tight Regret – α(G) “replaces” K Cons: – requires to observe G – solves an LP each step ? ? ? ? ? ? ? ? 30

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.

Similar presentations

Presentation on theme: "Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.

Similar presentations

Presentation on theme: "Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion."— Presentation transcript:

Similar presentations

About project

Feedback