Online Learning Yiling Chen
Machine Learning Use past observations to automatically learn to make better predictions or decisions in the future A large field We are scratching the surface only for part of it.
Example: Click Prediction
Example: Recommender System Netflix challenge
Spam Prediction Unknown Sender Sent to more than 10 people “Cheap” or “Sale” “Dear Sir”…Spam? Need some reasonable concept classes: Disjunctions: Spam if “Dear Sir” and “Sent to more than 10 people” Threshold: Spam if “Dear Sir” + Sent to more than 10 people + Unknown sender > 2
Batch Learning Unknown Sender Sent to more than 10 people “Cheap” or “Sale” “Dear Sir”…Spam? Learning Algorithm Prediction Rule for New Examples
Online Learning Unknown Sender Sent to more than 10 people “Cheap” or “Sale” “Dear Sir”…Spam? 0000? Unknown Sender Sent to more than 10 people “Cheap” or “Sale” “Dear Sir”…Spam? 1111? 0 1 How to update the prediction rule?
Competitive Ratio Optimal offline algorithm: optimal in hind sight Competitive ration = performance of the online algorithm/performance of the optimal offline algorithm
Why We Care? The “Learning from Expert Advice” setting is an information aggregation problem. Spam if “Dear Sir” and “Sent to more than 10 people” Spam if “Dear Sir” + Sent to more than 10 people + Unknown sender > 2 Yahoo!’s spam filter Can we make use of predictions of these “experts”?
Basic Online Learning Setting The learning algorithm sees a new example The algorithm predicts a label for this example After the prediction, the true label is observed Algorithm makes a mistake if Update the prediction rule
Two Goals Minimize the number of mistakes – Hope that (# of mistakes/# of rounds) -> 0 – Assume that there is a perfect target function Minimize regret – Hope that (# of mistakes - # of mistakes by comparator)/# of rounds -> 0 – Adversarial setting
Minimizing the Number of Mistakes
Halving Algorithm Let C be a finite concept class. Assume that there exist c in C such that c( ) =. Then, the number of mistakes made by Halving is no more than log|C|.
Halving Algorithm Current version space contains all functions that are consistent with the observations so far. At each round t, predict label to be the same as if it is chosen by the majority of functions in the current concept space. Update the version space
Monotonic Disjunctions Concept class can be disjunctions of r variables |C| can be large Halving is not computationally tractable
The Winnow Algorithm
# mistakes <= O(log rn) We can treat each variable (feature) as an expert Winnow updates weights of the expert dynamically
Minimizing Regret No assumption on the distribution of examples No assumption on target function Adversarial setting
# Mistakes <= 2.41 (m + log n)
# of Mistakes <= m + log n + O( sqrt(m log n))