Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.

Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam

Example task: Find best news articles based on user context; optimize click-through rate Example task: Tune ad display parameters (e.g., mainline reserve) to optimize revenue Example task: Improve ranking of QAC to optimize suggestion usage Typical approach: lots of offline tuning + AB testing.

[Kohavi et al. ’09, ‘12] Example: which search interface results in higher revenue?

Image adapted from: https://www.flickr.com/photos/prayitnophotography/4464000634

Address key challenge: how to balance exploration and exploitation – explore to learn, exploit to benefit from what has been learned. = Reinforcement learning problem where actions do not affect future states

Example

both arms are promising, higher uncertainty for C Bandit approaches balance exploration and exploitation based on expected payoff and uncertainty.

[Li et al. ‘12]

Contextual bandits [Li et al. ‘12] Example results: Balancing exploration and exploitation is crucial for good results.

1) Balance exploration and exploitation, to ensure continued learning while applying what has been learned 2) Explore in a small action space, but learn in a large contextual space

Illustrated Sutra of Cause and Effect "E innga kyo" by Unknown - Woodblock reproduction, published in 1941 by Sinbi-Shoin Co., Tokyo. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:E_innga_kyo.jpg#mediaviewer/File:E_innga_kyo.jpg

Problem: estimate effects of mainline reserve changes. [Bottou et. al ‘13]

controlled experiment counterfactual reasoning

Key idea: estimate what would have happened if a different system (distribution over parameter values) had been used, using importance sampling. Step 1: factorize based on known causal graph This works because: [Bottou et. al ‘13] Step 2: compute estimates using importance sampling Example distributions: [Precup et. al ‘00]

[Bottou et. al ‘13] Counterfactual reasoning allows analysis over a continuous range.

1) Leverage known causal structure and importance sampling to reason about “alternative realities” 2) Bound estimator error to distinguish between uncertainty due to low sample size and exploration coverage

Compare two rankings: 1)Generate interleaved (combined) ranking 2)Observe user clicks 3)Credit clicks to original rankers to infer outcome document 1 document 2 document 3 document 4 document 2 document 3 document 4 document 1 document 2 document 3 document 4 Example: optimize QAC ranking

Dueling bandit gradient descent (DBGD) optimizes a weight vector for weighted- linear combinations of ranking features. current best weight vector sample unit sphere to generate candidate ranker randomly generated candidate feature 1 feature 2 Relative listwise feedback is obtained using interleaving Learning approach [Yue & Joachims ‘09]

generate many candidates and select the most promising one feature 1 feature 2 [Hofmann et al. ’13c] Approach: candidate pre-selection (CPS)

informational click model [Hofmann et al. ’13b, Hofmann et al. ’13c] From earlier work: learning from relative listwise feedback is robust to noise. Here: adding structure further dramatically improves performance.

1) Avoid combinatorial action space by exploring in parameter space 2) Reduce variance using relative feedback 3) Leverage known structures for sample-efficient learning

Contextual bandits Systematic approach to balancing exploration and exploitation; contextual bandits explore in small action space but optimize in large context space. Counterfactual reasoning Leverages causal structure and importance sampling for “what if” analyses. Online learning to rank Avoids combinatorial explosion by exploring and learning in parameter space; uses known ranking structure for sample-efficient learning.

Applications Assess action and solution spaces in a given application, collect and learn from exploration data, increase experimental agility Try this (at home) Try open-source code samples; Living labs challenge allows experimentation with online learning and evaluation methods Challenge: http://living- labs.net/challenge/ http://living- labs.net/challenge/ Code: https://bitbucket.org /ilps/lerot https://bitbucket.org /ilps/lerot

Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.

Similar presentations

Presentation on theme: "Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.

Similar presentations

Presentation on theme: "Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam."— Presentation transcript:

Similar presentations

About project

Feedback