Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science.

Slides:



Advertisements
Similar presentations
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Exact Inference in Bayes Nets
Probabilistic networks Inference and Other Problems Hans L. Bodlaender Utrecht University.
CUSTOMER NEEDS ELICITATION FOR PRODUCT CUSTOMIZATION Yue Wang Advisor: Prof. Tseng Advanced Manufacturing Institute Hong Kong University of Science and.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
Efficient Informative Sensing using Multiple Robots
Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Bayesian network inference
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.
Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
Conditional Random Fields
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Near-optimal Sensor Placements: Maximizing Information while Minimizing Communication Cost Andreas Krause, Carlos Guestrin, Anupam Gupta, Jon Kleinberg.
Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Reasoning Under Uncertainty: Bayesian networks intro CPSC 322 – Uncertainty 4 Textbook §6.3 – March 23, 2011.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
SCALABLE INFORMATION-DRIVEN SENSOR QUERYING AND ROUTING FOR AD HOC HETEROGENEOUS SENSOR NETWORKS Paper By: Maurice Chu, Horst Haussecker, Feng Zhao Presented.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
5 Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable!
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
SCALABLE INFORMATION-DRIVEN SENSOR QUERYING AND ROUTING FOR AD HOC HETEROGENEOUS SENSOR NETWORKS Paper By: Maurice Chu, Horst Haussecker, Feng Zhao Presented.
Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
John Lafferty Andrew McCallum Fernando Pereira
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Data Mining and Decision Support
Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.
Definition and Complexity of Some Basic Metareasoning Problems Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Perfect recall: Every decision node observes all earlier decision nodes and their parents (along a “temporal” order) Sum-max-sum rule (dynamical programming):
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
Hidden Markov Models BMI/CS 576
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Qian Liu CSE spring University of Pennsylvania
Hidden Markov Models Part 2: Algorithms
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
MURI Kickoff Meeting Randolph L. Moses November, 2008
CS 416 Artificial Intelligence
Variable Elimination Graphical Models – Carlos Guestrin
Presentation transcript:

Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University

Related applications Medical expert systems  select among potential examinations Sensor scheduling  observations drain power, require storage Active learning, experimental design...

Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 Part-of-Speech Tagging Andreasisgivingatalk X1X1 X2X2 X3X3 X4X4 X5X5 S P O Y 3 =P observe S P O Y 2 =PY 2 =O observe S P O reward: 0.8reward: 0.9reward: 0.3 Classify each word to belong to subject, predicate, object Classification must respect sentence structure Values: (S)ubject, (P)redicate, (O)bject Ask expert k most informative questions Need to compute expected reward for any selection! What does “most informative” mean? Which reward function should we use? Our probabilistic model provides certain a priori classification accuracy. What if we could ask an expert?

Reward functions Depend on probability distributions: E[ R(X | O) ] :=  o P(o) R( P(X | O = o) ) In classification / prediction setting, rewards measure reduction of uncertainty Margin to runner-up:  confidence in most likely assignment Information gain:  uncertainty about hidden variables In decision theoretic setting, reward measures the value of information

Reward functions: Value of Information (VOI) Medical decision making: Utility depends on actual condition and chosen action Actual condition unknown! Only know P(ill | O=o) EU(a | O=o) = P(ill | O=o) U(ill, a) + P(healthy | O=o) U(healthy, a) VOI = expected maximum expected utility healthyill Treatment-$$$ No treatment0-$$$ The more we know, the more effectively we can act Conditions Actions

Local reward functions Often, we want to evaluate rewards on multiple variables Natural way of generalizing rewards to this setting: E[ R(X | O) ] :=  i E[ R(X i | O) ] Useful representation for many practical problems Not fundamentally necessary in our approach For any particular observation, local reward functions can be efficiently evaluated using probabilistic inference!

Costs and budgets Each variable X can have a different cost c(X) Instead of only allowing k questions, we specify integer budget B which we can spend Examples: Medical domain: Cost of examinations Sensor networks: Power consumption Part-of-speech tagging: Fee for asking expert

The subset selection problem Consider myopically selecting This can be seen as an attempt to nonmyopically maximize Selected subset O is specified in advance (open loop) E[R({O 1 })], E[R({O 2, O 1 })],..., E[R({O k,O k-1... O 1 })] most informative singleton most informative (greedy) improvement greedy selection =  X 2 O c(X) total cost of observing budget Often, we can acquire information based on earlier observations. What about this closed loop setting?

This outcome is inconsistent with our beliefs, so we better explore further by querying Y 1 Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 The conditional plan problem Andreasisgivingatalk X1X1 X2X2 X3X3 X4X4 X5X5 Y 2 =P Values: (S)ubject, (P)redicate, (O)bject Assume, most informative query would be Y 2 This outcome is consistent with our beliefs, so we can e.g. stop querying. Y 2 =S Now assume we observe a different outcome

The conditional plan problem Conditional plan selects different subset  (s) for all outcomes S = s Find conditional plan  nonmyopically maximizing Y 2 = ? Y 5 = ?Y 3 = ?Y 5 = ? S P O Y 3 = ? Y 4 = ? S P O Y 1 = ?Y 4 = ? S P O stop Nonmyopic planning implies that we construct the entire (exponentially large) plan in advance! Not clear if even compactly representable!

A nonmyopic analysis Problems intuitively seem hard Most previous approaches are myopic Greedily select next best observation In this paper, we present the first optimal nonmyopic algorithms for a non-trivial class of graphical models complexity theoretic hardness results

Inference in graphical models Inference P(X i = x | O = o) needed to compute local reward functions Efficient inference possible for many graphical models: X1X1 X2X2 X3X3 X3X3 X1X1 X2X2 X4X4 X5X5 X1X1 X3X3 X5X5 X2X2 X4X4 X6X6 What about optimizing value of information?

Chain graphical models Filtering: Only use past observations Sensor scheduling,... Smoothing: Use all observations Structured classification,... Contains conditional chains HMMs, chain CRFs X1X1 X2X2 X3X3 X4X4 X5X5 flow of information

Key insight X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(1:3) Reward(3:6)Reward(1:6) = Reward(1:3) + Reward(3:6) + const(3) X3X3 Making observation Expected Reward for subchain 1:3 when observing X 1 and X 3 Expected Reward for subchain 3:6 when observing X 3 and X 6 Reward functions decompose along chain! Expected Reward for subchain 1:6 when observing X 1, X 3 and X 6

Dynamic programming Base case: 0 observations left Compute expected reward for all sub-chains without making observations Inductive case: k observations left Find optimal observation (= split), optimally allocate budget (depending on observation)

Base case X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(1:2) X1X1 X2X2 X3X3 Reward(1:3) X1X1 X2X2 X3X3 X4X4 Reward(1:4) X1X1 X2X2 X3X3 X4X4 X5X5 Reward(1:5) X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(1:6) X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(2:3) X2X2 X3X3 X4X4 X5X5 X6X6 Reward(2:4) X2X2 X3X3 X4X4 X5X5 Reward(2:5) X2X2 X3X3 X4X4 X5X5 X6X6 Reward(2:6) X1X Beginning of sub-chain End of sub-chain

0 1 1 Inductive case Reward = E.g., compute value for spending first of three observations at X 3 ; have 2 observations left Compute expected reward for subchain a:b, making k observations, using expected rewards for all subchains with at most k-1 observations = 4.0 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 spend obs. here = = 4.6 computed using base case and inductive case for 1,2 obs. Can compute value of any split by optimally allocating budgets, referring to base and earlier inductive cases. For subset selection / filtering, speedups are possible.

Value of information for split at 3: 3.9, best: 3.9 Value of information for split at 4: 3.8, best: 3.9 Value of information for split at 5: 3.3, best: 3.9 Value of information for split at 2: 3.7, best: 3.7 Inductive case (continued) X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(1:2) Reward(2:6) Reward(1:6) = Reward(1:2) + Reward(2:6) + const(2) current best X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(1:3) Reward(3:6) Reward(1:6) = Rew ard (1:3) + Rew ard (3:6) + const(3) current best X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward (1:4) Reward(4:6) Reward(1:6) = Reward(1:4) + Reward(4:6) + const(4) current best X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 Reward(1:5) Reward(5:6) Reward(1:6) = Reward(1:5) + Reward(5:6) + const(5) current best Optimal VOI for subchain 1:6 and k observations to make = need to compute optimal VOI with k observation left 3.9 Compute expected reward for subchain a:b, making k observations, using expected rewards for all subchains with at most k-1 observations End of sub-chain Beginning of sub-chain Tracing back the maximal values allows to recover the optimal subset or conditional plan! Tables represent solution in polynomial space! Here we don’t need to allocate budget Now we need to optimally allocate our budget!

Results about optimal algorithms Theorem: For chain graphical models, our algorithms compute the nonmyopic optimal subset in time O( d B n 2 )for filtering and in time O( d 2 B n 3 )for smoothing the nonmyopic optimal conditional plan in time O( d 2 B n 2 )for filtering and in time O( d 3 B 2 n 3 ) for smoothing d:maximum domain size; B: budget we can spend for observations n: number of random variables

Evaluation of our algorithms Three real-world data sets Sensor scheduling CpG-island detection Part-of-speech tagging Goals: Compare optimal algorithms with (myopic) heuristics Relating objective values to prediction accuracy

Evaluation: Temperature Temperature data from sensor deployment at Intel Research Berkeley Task: Scheduling of single sensor Select k optimal times to observe sensor during each day Optimize sum of residual entropies

Optimal algorithms significantly improve on commonly used myopic heuristics Conditional plans give higher rewards than subsets Evaluation: Temperature Baseline: Uniform spacing of observations 24h0h

Evaluation: CpG-island detection Annotated gene DNA sequences Task: Predict start and end of CpG island ask expert to annotate k places in sequence optimize classification margin

Evaluation: CpG-island detection Optimal algorithms provide better prediction accuracy Even small differences in objective value can lead to improved prediction results

Evaluation: Reuters data POS-Tagging CRF trained on Reuters news archive data Task: Ask expert for k most informative tags Maximize classification margin

Evaluation: POS-Tagging Optimizing classification margin leads to improved precision and recall

Can we generalize? Many Graphical Models Tasks (e.g. Inference, MPE) which are efficiently solvable for chains can be generalized to polytrees Even computing expected rewards is hard Optimization is a lot harder! X1X1 X2X2 X3X3 X4X4 X5X5

Complexity Classes (Review) P NP – SAT #P – #SAT NP PP – E-MAJSAT Probabilistic inference in general graphical models MAP assignment on general GMs; Some planning problems Wildly more complex!! Probabilistic inference in polytrees

Hardness results Proof by reduction from #3CNF-SAT and E-MAJSAT Theorem: Even on discrete polytrees, computing expected rewards is #P-complete subset selection is NP PP -complete computing conditional plans is NP PP -hard As we presented last week at UAI, approximation algorithms with strong guarantees available! subset selection computing rewards

Summary We developed efficient optimal nonmyopic algorithms for chain graphical models subset selection and conditional plans filtering + smoothing Even on discrete polytrees, problems become wildly intractable! Chain is probably only graphical model we can hope to solve optimally Our algorithms improve prediction accuracy Provide viable optimal approach for a wide range of value of information tasks