Carnegie Mellon Algorithms for Answering Queries with Graphical Models Thesis committee: Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW) 21 May 2009 Thesis Proposal Anton Chechetka
Motivation Activity recognitionSensor networks Patient monitoring & diagnosis 2 Image credit: [Pentney+al:2006] Image credit:
Motivation True temperature in a room? Has the person finished cooking? Is the patient well? Common problem: compute P(Q | E =e) Sensor 3 reads 25°C The person is next to the kitchen sink (RFID) Heart rate is 70 BPM 3
Common solution Common problem: compute P(Q | E =e) (query) Common solution: probabilistic graphical models [Pentney+al:2006][Deshpande+al:2004][Beinlich+al:1988] This thesis: New algorithms for learning and inference in PGMs to make answering queries better 4
Graphical models Represent factorized distributions X are small subsets of X compact representation X3X3 X4X4 X5X5 X1X1 X2X2 Fundamental problems: P(Q|E=e) given a PGM? Best parameters f given the structure? Optimal structure (i.e. sets X )? Learn/construct structure Learn/define parameters InferenceP(Q|E=e) #P-complete / NP-complete exp(|X|) complexity NP-complete corresponding graph structure 5
This thesis Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning simple local models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 6
Leaning tractable models Every step in the pipeline is computationally hard for general PGMs Compounding errors But there are exact inference and parameter learning algorithms with exp(graph treewidth) complexity So if we learn low-treewidth models, all the rest is easy! Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 7
Treewidth Learn low-treewidth models all the rest is easy! Treewidth = size of largest clique in a triangulated graph Computing treewidth is NP-complete in general But easy to construct graphs with given treewidth Convenient representation: junction tree Learn/construct structure Learn/define parameters InferenceP(Q|E=e) C1C1 X 1,X 5 X 4,X 5 X 1,X 2 X 1,X 5 X 1,X 2,X 7 X 1,X 2,X 5 X 1,X 4,X 5 X 4,X 5,X 6 X 1,X 3,X 5 C2C2 C3C3 C4C4 C5C
Junction trees Learn junction trees all the rest is easy! Other classes of tractable models exist, e.g. [Lowd+Domingos:2008] Running intersection property Most likely junction tree of fixed treewidth >1 is NP-complete We will look for good approximations 9 C1C1 X 1,X 5 X 4,X 5 X 1,X 2 X 1,X 5 X 1,X 2,X 7 X 1,X 2,X 5 X 1,X 4,X 5 X 4,X 5,X 6 X 1,X 3,X 5 C2C2 C3C3 C4C4 C5C5 X 4,X 5,X 6 X 1,X 3,X 5 X 1,X 2,X 5 X 1,X 4,X 5 X 4,X 5,X 6 X 1,X 3,X 5 X 1,X 2,X 5 X 1,X 4,X 5
Independencies in low-treewidth distributions P(X) factorizes according to a JT conditional mutual information works in the other way too! X 1,X 5 X 1,X 2,X 7 X1,X2,X5X1,X2,X5 X1,X4,X5X1,X4,X5 X4,X5,X6X4,X5,X6 X1,X3,X5X1,X3,X5 conditional independencies hold 10 X =X 2 X 3 X 7 X =X 4 X 6
Constraint-based structure learning We will look for JTs where this holds 11 S1S1 S2S2 S3S3 S4S4 … Take all candidate separators Construct a junction tree (e.g. using dynamic programming) S1S1 S2S2 Constraint-based structure learning S1S1 S2S2 X -S 1 X -S 2 Partition remaining variables into weakly dependent subsets X -S 3 S3S3 S4S4 S V X -VS I(V, X \VS | S) < ??
Mutual information estimation I(V, X \VS | S) < ?? definition: I(A,B|S) = H(A| S) – H(A|BS) naïve estimation of costs exp(|X|), too expensive our work: upper bound on I(V, X \VS | S), using values of I(Y,Z|S) for |YZ| treewidth+1 there are O(|X| treewidth+1 ) subsets Y and Z complexity polynomial in |X| sum over all 2 |X| assignments to X 12
Mutual information estimation I(V, X \VS | S) < ?? Theorem: suppose that P(X), S, V are s.t. an -JT of treewidth k for P(X) exists for every A V, B X -VS s.t. |AB| k+1 I( ) Then I(V, X -VS | S) |X|( + ) V X -VS I(V,X -VS | S)=?? I(A,B|S) A B Complexity O(|X k+1 |) exponential speedup No need to know the -JT, only that it exists The bound is loose only when there is no hope to learn a good JT 13 hard easy |AB| treewidth+1
Guarantees on learned model quality Theorem: suppose that P(X) is s.t. a strongly connected -JT of treewidth k for P(X) exists Then our alg. will with probability at least (1- ) find a JT (C,E) s.t. Corollary: strongly connected junction trees are PAC-learnable quality guarantee poly samplespoly time 14 usingsamplesandtime
Related work 15 Ref.ModelGuaranteesTime [Bach+Jordan:2002]tractablelocalpoly(n) [Chow+Liu:1968]treeglobalO(n 2 log n) [Meila+Jordan:2001]tree mixlocalO(n 2 log n) [Teyssier+Koller:2005]compactlocalpoly(n) [Singh+Moore:2005]allglobalexp(n) [Karger+Srebro:2001]tractableconst-factorpoly(n) [Abbeel+al:2006]compactPACpoly(n) [Narasimhan+Bilmes:2004]tractablePACexp(n) our worktractablePACpoly(n)
Results – typical convergence time good results early on in practice 16 Test log-likelihood
Results – log-likelihood better our method 17 OBS local search in limited in-degree Bayes nets Chow-Liu most likely JTs of treewidth 1 Karger-Srebro constant-factor approximation JTs
This thesis Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning simple local models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 18
Approximate inference is still useful Often learning a tractable graphical model is not an option Need domain knowledge Templatized models Markov logic nets Probabilistic relational models Dynamic Bayesian nets This part: the (intractable) PGM is a given What can we do with the inference? What if we know the query variables Q and evidence E=e? 19
Query-specific simplification This part: the (intractable) PGM is a given Observation: often many variables are unknown, but also not important to the user Observation: usually, variables far away from the query do not affect P(Q) much Suppose we know the variables Q of interest (the query) query these have little effect on P(Q) 20
Query-specific simplification Observation: variables far away from the query do not affect P(Q) much query these have little effect on P(Q) Idea: discard parts of the model that have little effect on the query Observation: values of potentials are important Our work: edge importance from values of potentials efficient algorithms for model simplification focused inference as soft model simplification want this part first 21
Belief propagation [Pearl:1988] For every edge X i -X j and variable, a message Belief about the marginal over X i : Algorithm: until convergence Fixed point of BP( ) is the solution 22
Model simplification problem Model simplification problem: which messages to skip updating s.t. - inference cost gets small enough - BP fixed point for P(Q) does not change much 23 query
Edge costs Inference cost IC(i j) complexity of one BP update for m i j Approximation value AV(i j) Measure of influence of m i j on the belief P(Q) Model simplification problem: Find the set E’ E of edges s.t. - AV(i j) max - IC(i j) inference budget maximize fit quality keep inference affordable Lemma: Model simplification problem is NP-hard Greedy edge selection gets -factor approximation 24
Approximation values Approximation value AV(i j) Measure of influence of m i j on the belief P(Q) (vn)(vn)query m r q = BP * (m v n ) (rq)(rq) define: path strength( ) = fix all messages not in define: AV(i j) = max (i j) path strength( ) simple path max-sensitivity approximation value AV(i j) is the single strongest dependency (in derivative) that (i j) participates in 25 ( i j ) - how important is it?
Efficient model simplification max-sensitivity approximation value AV(i j) is the single strongest dependency (in derivative) that (i j) participates in Lemma: with max-sensitivity edge values can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge (using [Mooij+Kappen:2007]) Templated models: only instantiate model parts that are in the solution Simplification complexity independent of the size of the full model (only depends on the solution size) 26
Future work: multi-path dependencies query (rq)(rq) (ij)(ij) Want to take both of these into account All paths possible, but expensive: O(|E| 3 ) k strongest paths? AV(i j) = max (i j) 1,…, k m path strength( m ) best-first search with at most k visits of an edge? 27
observation: do not take the possible range of the endpoint message into account path strength( ) is the largest derivative value along the path w.r.t the endpoint message Perturbation approximation values query (vn)(vn) simple path m r q = BP * (m v n ) (rq)(rq) fix all messages not in define: path strength*( ) = upper bound on m r q = change mean value theorem tighter bound from BP message properties 28
Efficient model simplification Lemma: with max-perturbation edge values, assuming that the message derivatives along paths are known, can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge extra work: need to know derivatives along paths define: max-perturbation AV( i) = max ( i) path strength*( ) solution: max-sensitivity best-first search as a subroutine 29
Future work: efficient max-perturbation simplification extra work: need to know derivatives along paths define: path strength*( ) = min ||f || current lower bound on path strength* from BFS AV(i j) only need exact derivative if ||f || derivative is in this range 30 not always!
1 Future work: computation trees ? … prune computation trees according to edge importance 31 computation tree traversal ~ message update schedule
Focused inference BP proceeds until all beliefs converge But we only care about query beliefs Residual importance weighting for convergence testing For residual BP more attention to more important regions Weigh residuals by edge importance convergence here is more important convergence here is less important 32
Related work Minimal submodel to have exactly the same P(Q|E=e) regardless of the values of potentials Knowledge-based model construction [Wellman+al:1992,Richardson+Domingos:2006] Graph distance as edge importance measure [Pentney+al:2006] Empirical mutual information as variable importance measure [Pentney+al:2007] Inference in simplified model to quantify the effect of an extra edge exactly [Kjaerulff:1993,Choi+Darwiche:2008] 33
This thesis Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning simple local models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 34
Local models: motivation 35 Learn/construct structure Approximate parameters Approximate inference P(Q|E=e) Learn tractable structure optimal parameters exact inference P(Q|E=e) Common approach: This talk, part 1: What if no single tractable structure fits well?
Local models: motivation 36 What if no single tractable structure fits well? Regression analogy: evidence query q=f(e) no single line fits well e q But locally almost linear dependence solution: learn local tractable models learn tractable structure optimal parameters exact inference P(Q|E=e) get evidence asssignment E=e learn tractable structure for E=e parameters for E=e
Local models: example 37 exact inference P(Q|E=e) get evidence asssignment E=e learn tractable structure for E=e parameters for E=e example: local conditional random fields (CRFs) global CRFlocal CRF featureweight query-specific structure. I (E) {0, 1} E=e 2 … E=e n E=e 1 E=e 2 … E=e n E=e 1
Learning local models 38 local CRF query-specific structure. I (E) {0, 1} … E=e n E=e 1 Need to learn w and QS structure I(E) known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 … E=e n E=e 1 Iterate! need query values here! cannot use at test time
Learning local models 39 … E=e n E=e 1 parametrize I(E) by V : I =I(E, V ) known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 learn w and QS structure parameters V optimize V so that I(E, V ) mimics the good local structures well for training data … E=e 1, V E=e n, V Iterate!
Future work: better exploration 40 … E=e n E=e 1 known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 need to avoid shallow local minima - multiple structures per datapoint - stochastic optimization sample structures will these be different?
Future work: multi-query optimization 41 separate structure for every query may be too costly query clustering - directly using evidence - using inferred model parameters (given w and V ) optimize V so that I(E, V ) mimics the good local structures well for training data … E=e 1, V E=e n, V … E=e n E=e 1 known structures for every training point … Q=q n Q=q 1 + optimal weights w (convex opt) known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 Iterate!
optimal weights w (convex opt) … E=e n E=e 1 known structures for every training point … Q=q n Q=q 1 + optimize V so that I(E, V ) mimics the good local structures well for training data … E=e 1, V E=e n, V Future work: faster local search 42 need efficient structure learning - amortize inference cost for scoring multiple search steps … … nuisance variable query variable known weights w … E=e n E=e 1 … Q=q n Q=q 1 + good local structures (e.g. local search) … E=e n E=e 1 need support for nuisance vars in structure scores
Recap Learn/construct structure Learn/define parameters InferenceP(Q|E=e) 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query 3. Learning local tractable models by exploiting evidence assignments Thesis contributions: [NIPS 2007] 43
Timeline Validation of QS model simplification Activity recognition data, MLN data QS simplification Multi-path extensions for edge importance measures Computation trees connections Max-perturbation computation speedups QS learning Better exploration (stochastic optimization / multiple structures per datapoint) Multi-query optimization Validation QS learning Nuisance variables support Local search speedups Quality guarantees Validation Write thesis, defend 44 Summer 2009 Fall 2009 Spring 2010 Summer 2010
Thank you! 45 Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf
Speeding things up Constraint-based algorithm: 1. set L = 2. for every potential separator S X s.t. |S|=k 3. do I( ) estimation, change L 4.find junction tree (C,E) consistent with L there are O(|X| k ) separators here Observation: there are |X|-k separators in (C,E) I( ) estimations for the rest O(|X| k ) separators are wasted Faster heuristic: 1. until (C,E) passes checks 2. do I( ) estimation, change L 3. find junction tree (C,E) consistent with L 46
Faster heuristic: 1. until (C,E) passes checks 2. do I( ) estimation, change L 3. find junction tree (C,E) consistent with L Speeding things up Faster heuristic: 1. estimate I( ) with |Y|=2, form L 2. do 3. find junction tree (C,E) consistent with L 4. estimate I( |S ) with |Y|=k for S S, update L 5. check if (C,E) is still an -JT with the updated I( |S ) 6. until (C,E) passes checks V X -VS I(V,X -VS | S)=?? I(Y V,Y X -VS |S) YVYV Y X -VS Recall that our upper bound on I( ) uses all Y X \S for |Y| k Idea: get a rough estimate by only looking at smaller Y (e.g. |Y|=2) 47