Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science, University of Alberta Yahoo! Research Task: – Need classifier for diagnosing cancer Given: – pool of patients whose subtype is known feature values are NOT known – cost c(X i ) of purchasing feature X i – budget for purchasing feature values Produce: – classifier to predict subtype of novel instance, based on values of its features – … learned using only features purchased Process : Initially, learner R knows NO feature values At each time, – R can purchase value of a feature of an instance at cost – based on results of prior purchases – … until exhausting fixed budget Then R produces classifier Challenge : At each time, what should R purchase? – which feature of which instance? Purchasing a feature value… – Alters accuracy of classifier – Decreases remaining budget Quality: – accuracy of classifier obtained – REGRET: difference of classifier vs optimal Simpler Task Task: Determine which coin has highest P(head) … based on results of only 20 flips C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C Which coin?? C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 -- 1H C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C H - 1T C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C H 1T C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 3H 2T - 1H 1T 5H 3T 1T 2H 1T 0 ⋮ X1X1 X2X2 X3X3 X4X4 Y ???? 1 ???? 0 ???? 1 ???? 1 $95 $0 ⋮ Costs: Classifier Learner Which feature of which instance?? Selector “C 7 ” Original Task Bayesian Framework: – Coin C i drawn from Beta(a i, b i ) MDP – State = a 1, b 1, …, a k, b k, r – Action = “Flip coin i” – Reward = 0 if r 0; else max i { a i /(a i +b i ) } –solve for optimal purchasing policy NP-hard Develop tractable heuristic policies that perform well Round Robin – Flip C 1, then C 2, then... Biased Robin – Flip C i – If heads, flip C i ; else C i+1 Greedy Loss Reduction – Loss1(C i ) = loss of flipping C i once – Flip C * = argmin i { Loss1(C i ) } once Single Feature Lookahead (k) – SFL(C i, k) = loss of spending k flips on C i – Flip C * = argmin i { SFL(C i, k) } once Heuristic Policies X1X1 X2X2 X3X3 X4X4 Y ?? + ? 1 ???? 0 ???? 1 ???? 1 $100 $85 X1X1 X2X2 X3X3 X4X4 Y ?? + ? 1 ? -- ? ? 0 ???? 1 ???? 1 X1X1 X2X2 X3X3 X4X4 Y ?? + ? 1 ? ? ? 0 ?? + 1 ? + 1 A is APPROXIMATION Algorithm iff A ’s regret is bounded by a constant worse than optimal (for any budget, #coins, …) NOT approximation alg’s: Round Robin Random Greedy Interval Estimation Beta(1,1); n=10, b=10Beta(1,1); n=10, b=40 Beta(10,1); n=10, b=40 Results: Obvious approach Round robin is NOT good ! Contingent policies work best Important to know/use remaining budget (UAI’03; UAI’04; COLT’04; ECML’05) Related Work Not standard Bandit: Pure explore for “b” steps, then single exploit Not on-line learning No “feedback” until end Not PAC-learning Fixed #instances; NOT “polynomial” Not std experimental design This is simple active learning General Budgeted Learning is different … Use NaïveBayes classifier as… it handles missing data no feature interaction Each +class instance is “the same”, … only O(N) parameters to estimate regret budget optimal alg A rArA Extended Task So far, LEARNER (researcher) has to pay for features… but CLASSIFIER (“MD”) gets ALL feature values … for free ! Typically… MD also has constraints [… capitation …] Extended model: Hard budget for BOTH learner & classifier Eg: spend b L = $10,000 to learn a classifier, that can spend only b C = $50 /patient… Classifier = “Active Classifier” [GGR, 2002] policy (decision tree) sequentially gathers info about instance, until rendering decision (Must make decision by b C depth) Learner … spends b L gathering information, posterior distribution P(·) [using naïve bayes assumption] uses Dynamic Program to find best cost-b C policy for P(·) Double dynamic program! Too slow use heuristic policies Issues: Use “analogous” heuristic policies Round-robin (std approach) still bad SingleFeatureLookahead: how far ahead? k in SFL(k) ? Glass – Identical Feature Costs (b C =3) Heart Disease – Different Feature Costs (b C =7) Randomized SFL – Flip C i with probability exp( SFL(C i, k) ), once Issues: Round-robin still bad… very bad… Randomized SFL is best (Deterministic) SFL is “too focused”