Carnegie Mellon Algorithms for Answering Queries with Graphical Models Thesis committee: Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW) 21.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Lauritzen-Spiegelhalter Algorithm
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Bayesian network inference
Carnegie Mellon Focused Belief Propagation for Query-Specific Inference Anton Chechetka Carlos Guestrin 14 May 2010.
Carnegie Mellon Evidence-Specific Structures for Rich Tractable CRFs Anton Chechetka, Carlos Guestrin General approach: 1.Ground the model / features 2.Use.
Recent Development on Elimination Ordering Group 1.
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Global Approximate Inference Eran Segal Weizmann Institute.
Lecture 5: Learning models using EM
Belief Propagation, Junction Trees, and Factor Graphs
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Maximum Likelihood (ML), Expectation Maximization (EM)
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
Today Graphical Models Representing conditional dependence graphically
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Perfect recall: Every decision node observes all earlier decision nodes and their parents (along a “temporal” order) Sum-max-sum rule (dynamical programming):
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
Exact Inference Continued
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Bayesian Models in Machine Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Exact Inference Continued
Expectation-Maximization & Belief Propagation
Variable Elimination 2 Clique Trees
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Junction Trees 3 Undirected Graphical Models
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Mean Field and Variational Methods Loopy Belief Propagation
Generalized Belief Propagation
Presentation transcript:

Carnegie Mellon Algorithms for Answering Queries with Graphical Models Thesis committee: Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW) 21 May 2009 Thesis Proposal Anton Chechetka

Motivation Activity recognitionSensor networks Patient monitoring & diagnosis 2 Image credit: [Pentney+al:2006] Image credit:

Motivation True temperature in a room? Has the person finished cooking? Is the patient well? Common problem: compute P(Q | E =e) Sensor 3 reads 25°C The person is next to the kitchen sink (RFID) Heart rate is 70 BPM 3

Common solution Common problem: compute P(Q | E =e) (query) Common solution: probabilistic graphical models [Pentney+al:2006][Deshpande+al:2004][Beinlich+al:1988] This thesis: New algorithms for learning and inference in PGMs to make answering queries better 4

Graphical models Represent factorized distributions X  are small subsets of X  compact representation X3X3 X4X4 X5X5 X1X1 X2X2 Fundamental problems: P(Q|E=e) given a PGM? Best parameters f  given the structure? Optimal structure (i.e. sets X  )? Learn/construct structure Learn/define parameters InferenceP(Q|E=e) #P-complete / NP-complete exp(|X|) complexity NP-complete corresponding graph structure 5

This thesis Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning simple local models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 6

Leaning tractable models Every step in the pipeline is computationally hard for general PGMs Compounding errors But there are exact inference and parameter learning algorithms with exp(graph treewidth) complexity So if we learn low-treewidth models, all the rest is easy! Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 7

Treewidth Learn low-treewidth models  all the rest is easy! Treewidth = size of largest clique in a triangulated graph Computing treewidth is NP-complete in general But easy to construct graphs with given treewidth Convenient representation: junction tree Learn/construct structure Learn/define parameters InferenceP(Q|E=e) C1C1 X 1,X 5 X 4,X 5 X 1,X 2 X 1,X 5 X 1,X 2,X 7 X 1,X 2,X 5 X 1,X 4,X 5 X 4,X 5,X 6 X 1,X 3,X 5 C2C2 C3C3 C4C4 C5C

Junction trees Learn junction trees  all the rest is easy! Other classes of tractable models exist, e.g. [Lowd+Domingos:2008] Running intersection property Most likely junction tree of fixed treewidth >1 is NP-complete We will look for good approximations 9 C1C1 X 1,X 5 X 4,X 5 X 1,X 2 X 1,X 5 X 1,X 2,X 7 X 1,X 2,X 5 X 1,X 4,X 5 X 4,X 5,X 6 X 1,X 3,X 5 C2C2 C3C3 C4C4 C5C5 X 4,X 5,X 6 X 1,X 3,X 5 X 1,X 2,X 5 X 1,X 4,X 5 X 4,X 5,X 6 X 1,X 3,X 5 X 1,X 2,X 5 X 1,X 4,X 5

Independencies in low-treewidth distributions P(X) factorizes according to a JT conditional mutual information works in the other way too! X 1,X 5 X 1,X 2,X 7 X1,X2,X5X1,X2,X5 X1,X4,X5X1,X4,X5 X4,X5,X6X4,X5,X6 X1,X3,X5X1,X3,X5 conditional independencies hold 10 X  =X 2 X 3 X 7 X  =X 4 X 6

Constraint-based structure learning We will look for JTs where this holds 11 S1S1 S2S2 S3S3 S4S4 … Take all candidate separators Construct a junction tree (e.g. using dynamic programming) S1S1 S2S2 Constraint-based structure learning S1S1 S2S2 X -S 1 X -S 2 Partition remaining variables into weakly dependent subsets X -S 3 S3S3 S4S4 S V X -VS I(V, X \VS | S) <  ??

Mutual information estimation I(V, X \VS | S) <  ?? definition: I(A,B|S) = H(A| S) – H(A|BS) naïve estimation of costs exp(|X|), too expensive our work: upper bound on I(V, X \VS | S), using values of I(Y,Z|S) for |YZ|  treewidth+1 there are O(|X| treewidth+1 ) subsets Y and Z  complexity polynomial in |X| sum over all 2 |X| assignments to X 12

Mutual information estimation I(V, X \VS | S) <  ?? Theorem: suppose that P(X), S, V are s.t. an  -JT of treewidth k for P(X) exists for every A  V, B  X -VS s.t. |AB|  k+1 I(  )   Then I(V, X -VS | S)  |X|(  +  ) V X -VS I(V,X -VS | S)=?? I(A,B|S) A B Complexity O(|X k+1 |)  exponential speedup No need to know the  -JT, only that it exists The bound is loose only when there is no hope to learn a good JT 13 hard easy |AB|  treewidth+1

Guarantees on learned model quality Theorem: suppose that P(X) is s.t. a strongly connected  -JT of treewidth k for P(X) exists Then our alg. will with probability at least (1-  ) find a JT (C,E) s.t. Corollary: strongly connected junction trees are PAC-learnable quality guarantee poly samplespoly time 14 usingsamplesandtime

Related work 15 Ref.ModelGuaranteesTime [Bach+Jordan:2002]tractablelocalpoly(n) [Chow+Liu:1968]treeglobalO(n 2 log n) [Meila+Jordan:2001]tree mixlocalO(n 2 log n) [Teyssier+Koller:2005]compactlocalpoly(n) [Singh+Moore:2005]allglobalexp(n) [Karger+Srebro:2001]tractableconst-factorpoly(n) [Abbeel+al:2006]compactPACpoly(n) [Narasimhan+Bilmes:2004]tractablePACexp(n) our worktractablePACpoly(n)

Results – typical convergence time good results early on in practice 16 Test log-likelihood

Results – log-likelihood better our method 17 OBS  local search in limited in-degree Bayes nets Chow-Liu  most likely JTs of treewidth 1 Karger-Srebro  constant-factor approximation JTs

This thesis Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning simple local models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 18

Approximate inference is still useful Often learning a tractable graphical model is not an option Need domain knowledge Templatized models Markov logic nets Probabilistic relational models Dynamic Bayesian nets This part: the (intractable) PGM is a given What can we do with the inference? What if we know the query variables Q and evidence E=e? 19

Query-specific simplification This part: the (intractable) PGM is a given Observation: often many variables are unknown, but also not important to the user Observation: usually, variables far away from the query do not affect P(Q) much Suppose we know the variables Q of interest (the query) query these have little effect on P(Q) 20

Query-specific simplification Observation: variables far away from the query do not affect P(Q) much query these have little effect on P(Q) Idea: discard parts of the model that have little effect on the query Observation: values of potentials are important Our work: edge importance from values of potentials efficient algorithms for model simplification focused inference as soft model simplification want this part first 21

Belief propagation [Pearl:1988] For every edge X i -X j and variable, a message Belief about the marginal over X i : Algorithm: until convergence Fixed point of BP(  ) is the solution 22

Model simplification problem Model simplification problem: which messages to skip updating s.t. - inference cost gets small enough - BP fixed point for P(Q) does not change much 23 query

Edge costs Inference cost IC(i  j) complexity of one BP update for m i  j Approximation value AV(i  j) Measure of influence of m i  j on the belief P(Q) Model simplification problem: Find the set E’  E of edges s.t. -  AV(i  j)  max -  IC(i  j)  inference budget maximize fit quality keep inference affordable Lemma: Model simplification problem is NP-hard Greedy edge selection gets -factor approximation 24

Approximation values Approximation value AV(i  j) Measure of influence of m i  j on the belief P(Q) (vn)(vn)query m r  q = BP *  (m v  n ) (rq)(rq) define: path strength(  ) = fix all messages not in  define: AV(i  j) = max (i  j)  path strength(  ) simple path  max-sensitivity approximation value AV(i  j) is the single strongest dependency (in derivative) that (i  j) participates in 25 ( i  j ) - how important is it?

Efficient model simplification max-sensitivity approximation value AV(i  j) is the single strongest dependency (in derivative) that (i  j) participates in Lemma: with max-sensitivity edge values can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge (using [Mooij+Kappen:2007]) Templated models: only instantiate model parts that are in the solution Simplification complexity independent of the size of the full model (only depends on the solution size) 26

Future work: multi-path dependencies query (rq)(rq) (ij)(ij) Want to take both of these into account All paths possible, but expensive: O(|E| 3 ) k strongest paths? AV(i  j) = max (i  j)  1,…,  k  m path strength(  m ) best-first search with at most k visits of an edge? 27

observation: do not take the possible range of the endpoint message into account path strength(  ) is the largest derivative value along the path w.r.t the endpoint message Perturbation approximation values query (vn)(vn) simple path  m r  q = BP *  (m v  n ) (rq)(rq) fix all messages not in  define: path strength*(  ) = upper bound on m r  q = change mean value theorem tighter bound from BP message properties 28

Efficient model simplification Lemma: with max-perturbation edge values, assuming that the message derivatives along paths  are known, can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge extra work: need to know derivatives along paths  define: max-perturbation AV(   i) = max (   i)  path strength*(  ) solution: max-sensitivity best-first search as a subroutine 29

Future work: efficient max-perturbation simplification extra work: need to know derivatives along paths  define: path strength*(  ) = min  ||f  || current lower bound on path strength* from BFS AV(i  j) only need exact derivative if ||f  ||  derivative is in this range 30 not always!

1 Future work: computation trees ? … prune computation trees according to edge importance 31 computation tree traversal ~ message update schedule

Focused inference BP proceeds until all beliefs converge But we only care about query beliefs Residual importance weighting for convergence testing For residual BP  more attention to more important regions Weigh residuals by edge importance convergence here is more important convergence here is less important 32

Related work Minimal submodel to have exactly the same P(Q|E=e) regardless of the values of potentials Knowledge-based model construction [Wellman+al:1992,Richardson+Domingos:2006] Graph distance as edge importance measure [Pentney+al:2006] Empirical mutual information as variable importance measure [Pentney+al:2007] Inference in simplified model to quantify the effect of an extra edge exactly [Kjaerulff:1993,Choi+Darwiche:2008] 33

This thesis Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning simple local models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 34

Local models: motivation 35 Learn/construct structure Approximate parameters Approximate inference P(Q|E=e) Learn tractable structure optimal parameters exact inference P(Q|E=e) Common approach: This talk, part 1: What if no single tractable structure fits well?

Local models: motivation 36 What if no single tractable structure fits well? Regression analogy: evidence query q=f(e) no single line fits well e q But locally almost linear dependence solution: learn local tractable models learn tractable structure optimal parameters exact inference P(Q|E=e) get evidence asssignment E=e learn tractable structure for E=e parameters for E=e

Local models: example 37 exact inference P(Q|E=e) get evidence asssignment E=e learn tractable structure for E=e parameters for E=e example: local conditional random fields (CRFs) global CRFlocal CRF featureweight query-specific structure. I  (E)  {0, 1} E=e 2 … E=e n E=e 1 E=e 2 … E=e n E=e 1

Learning local models 38 local CRF query-specific structure. I  (E)  {0, 1} … E=e n E=e 1 Need to learn w and QS structure I(E) known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 … E=e n E=e 1 Iterate! need query values here! cannot use at test time 

Learning local models 39 … E=e n E=e 1 parametrize I(E) by V : I  =I(E, V ) known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 learn w and QS structure parameters V optimize V so that I(E, V ) mimics the good local structures well for training data … E=e 1, V E=e n, V Iterate!

Future work: better exploration 40 … E=e n E=e 1 known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 need to avoid shallow local minima - multiple structures per datapoint - stochastic optimization  sample structures will these be different?

Future work: multi-query optimization 41 separate structure for every query may be too costly query clustering - directly using evidence - using inferred model parameters (given w and V ) optimize V so that I(E, V ) mimics the good local structures well for training data … E=e 1, V E=e n, V … E=e n E=e 1 known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 Iterate!

optimal weights w (convex opt) … E=e n E=e 1 known structures for every training point … Q=q n Q=q 1 + optimize V so that I(E, V ) mimics the good local structures well for training data … E=e 1, V E=e n, V Future work: faster local search 42 need efficient structure learning - amortize inference cost for scoring multiple search steps … … nuisance variable query variable known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 need support for nuisance vars in structure scores

Recap Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning local tractable models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 43

Timeline Validation of QS model simplification Activity recognition data, MLN data QS simplification Multi-path extensions for edge importance measures Computation trees connections Max-perturbation computation speedups QS learning Better exploration (stochastic optimization / multiple structures per datapoint) Multi-query optimization Validation QS learning Nuisance variables support Local search speedups Quality guarantees Validation Write thesis, defend 44 Summer 2009 Fall 2009 Spring 2010 Summer 2010

Thank you! 45 Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf

Speeding things up Constraint-based algorithm: 1. set L =  2. for every potential separator S  X s.t. |S|=k 3. do I(  ) estimation, change L 4.find junction tree (C,E) consistent with L there are O(|X| k ) separators here Observation: there are |X|-k separators in (C,E)  I(  ) estimations for the rest O(|X| k ) separators are wasted Faster heuristic: 1. until (C,E) passes checks 2. do I(  ) estimation, change L 3. find junction tree (C,E) consistent with L 46

Faster heuristic: 1. until (C,E) passes checks 2. do I(  ) estimation, change L 3. find junction tree (C,E) consistent with L Speeding things up Faster heuristic: 1. estimate I(  ) with |Y|=2, form L 2. do 3. find junction tree (C,E) consistent with L 4. estimate I(  |S  ) with |Y|=k for S   S, update L 5. check if (C,E) is still an  -JT with the updated I(  |S  ) 6. until (C,E) passes checks V X -VS I(V,X -VS | S)=?? I(Y  V,Y  X -VS |S) YVYV Y  X -VS Recall that our upper bound on I(  ) uses all Y  X \S for |Y|  k Idea: get a rough estimate by only looking at smaller Y (e.g. |Y|=2) 47