Presentation is loading. Please wait.

Presentation is loading. Please wait.

. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

Similar presentations

Presentation on theme: ". PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and."— Presentation transcript:

1 . PGM: Tirgul 10 Learning Structure I

2 Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and P(B) vs. joint P(A,B) u Discover structural properties of the domain l Ordering of events l Relevance u Identifying independencies  faster inference u Predict effect of actions l Involves learning causal relationship among variables

3 Why Struggle for Accurate Structure? u Increases the number of parameters to be fitted u Wrong assumptions about causality and domain structure u Cannot be compensated by accurate fitting of parameters u Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc

4 Approaches to Learning Structure u Constraint based l Perform tests of conditional independence l Search for a network that is consistent with the observed dependencies and independencies u Pros & Cons  Intuitive, follows closely the construction of BNs  Separates structure learning from the form of the independence tests  Sensitive to errors in individual tests

5 Approaches to Learning Structure u Score based l Define a score that evaluates how well the (in)dependencies in a structure match the observations l Search for a structure that maximizes the score u Pros & Cons  Statistically motivated  Can make compromises  Takes the structure of conditional probabilities into account  Computationally hard

6 Likelihood Score for Structures First cut approach: l Use likelihood function u Recall, the likelihood score for a network structure and parameters is u Since we know how to maximize parameters from now we assume

7 Likelihood Score for Structure (cont.) Rearranging terms: where  H(X) is the entropy of X  I(X;Y) is the mutual information between X and Y I(X;Y) measures how much “information” each variables provides about the other I(X;Y)  0 I(X;Y) = 0 iff X and Y are independent I(X;Y) = H(X) iff X is totally predictable given Y

8 Likelihood Score for Structure (cont.) Good news: u Intuitive explanation of likelihood score: l The larger the dependency of each variable on its parents, the higher the score u Likelihood as a compromise among dependencies, based on their strength

9 Likelihood Score for Structure (cont.) Bad news: u Adding arcs always helps l I(X;Y)  I(X;Y,Z) l Maximal score attained by fully connected networks l Such networks can overfit the data --- parameters capture the noise in the data

10 Avoiding Overfitting “Classic” issue in learning. Approaches: u Restricting the hypotheses space l Limits the overfitting capability of the learner l Example: restrict # of parents or # of parameters u Minimum description length l Description length measures complexity l Prefer models that compactly describes the training data u Bayesian methods l Average over all possible parameter values l Use prior knowledge

11 Bayesian Inference  Bayesian Reasoning---compute expectation over unknown G  Assumption: G s are mutually exclusive and exhaustive  We know how to compute P(x[M+1]|G,D) Same as prediction with fixed structure  How do we compute P(G|D) ?

12 Marginal likelihood Prior over structures Using Bayes rule: P(D) is the same for all structures G Can be ignored when comparing structures Probability of Data Posterior Score

13 Marginal Likelihood u By introduction of variables, we have that u This integral measures sensitivity to choice of parameters Likelihood Prior over parameters

14 Marginal Likelihood: Binomial case Assume we observe a sequence of coin tosses…. u By the chain rule we have: recall that where N m H is the number of heads in first m examples.

15 Marginal Likelihood: Binomials (cont.) We simplify this by using Thus

16 Binomial Likelihood: Example  Idealized experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) (log P(D))/M

17 Marginal Likelihood: Example (cont.)  Actual experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 (log P(D))/M M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5)

18 Marginal Likelihood: Multinomials The same argument generalizes to multinomials with Dirichlet prior  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

19 Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood 1234567 Network 1: Two Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],…,Y[7]) XY

20 Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood 1234567 Network 2: Three Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],Y[4],Y[6],Y[7]) P(Y[2],Y[3],Y[5]) XY

21 Idealized Experiment u P(X = H) = 0.5 u P(Y = H|X = H) = 0.5 + pP(Y = H|X = T) = 0.5 - p -1.8 -1.75 -1.7 -1.65 -1.6 -1.55 -1.5 -1.45 -1.4 -1.35 -1.3 1101001000 Independent P = 0.05 P = 0.10 P = 0.15 P = 0.20 (log P(D))/M M

22 Marginal Likelihood for General Network The marginal likelihood has the form: where u N(..) are the counts from the data   (..) are the hyperparameters for each family given G Dirichlet Marginal Likelihood For the sequence of values of X i when X i ’ s parents have a particular value

23 Priors  We need: prior counts  (..) for each network structure G u This can be a formidable task l There are exponentially many structures…

24 BDe Score Possible solution: The BDe prior  Represent prior using two elements M 0, B 0 M 0 - equivalent sample size B 0 - network representing the prior probability of events

25 BDe Score Intuition: M 0 prior examples distributed by B 0  Set  (x i,pa i G ) = M 0 P(x i,pa i G | B 0 ) Note that pa i G are not the same as the parents of X i in B 0. Compute P(x i,pa i G | B 0 ) using standard inference procedures u Such priors have desirable theoretical properties l Equivalent networks are assigned the same score

26 Bayesian Score: Asymptotic Behavior Theorem: If the prior P(  |G) is “well-behaved”, then Proof:  For the case of Dirichlet priors, use Stirling’s approximation to  ( ) u General case, defer to incomplete data section

27 Asymptotic Behavior: Consequences u Bayesian score is consistent As M  the “true” structure G* maximizes the score (almost surely) For sufficiently large M, the maximal scoring structures are equivalent to G* u Observed data eventually overrides prior information l Assuming that the prior assigns positive probability to all cases

28 Asymptotic Behavior u This score can also be justified by the Minimal Description Length (MDL) principle u This equation explicitly shows the tradeoff between l Fitness to data --- likelihood term l Penalty for complexity --- regularization term

29 Scores -- Summary u Likelihood, MDL, (log) BDe have the form u BDe requires assessing prior network. It can naturally incorporate prior knowledge and previous experience u BDe is consistent and asymptotically equivalent (up to a constant) to MDL u All are score-equivalent G equivalent to G’  Score(G) = Score(G’)

30 Optimization Problem Input: l Training data l Scoring function (including priors, if needed) l Set of possible structures H Including prior knowledge about structure Output: l A network (or networks) that maximize the score Key Property: l Decomposability: the score of a network is a sum of terms.

31 Learning Trees u Trees: l At most one parent per variable u Why trees? l Elegant math  we can solve the optimization problem efficiently (with a greedy algorithm) l Sparse parameterization  avoid overfitting while adapting to the data

32 Learning Trees (cont.)  Let p(i) denote the parent of X i, or 0 if X i has no parents u We can write the score as u Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

33 Learning Trees (cont) Algorithm: u Construct graph with vertices: 1, 2, …  Set w(i  j) be Score( X j | X i ) - Score(X j ) u Find tree (or forest) with maximal weight l This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: This procedure finds the tree with maximal score When score is likelihood, then w(i  j) is proportional to I(X i ; X j ) this is known as the Chow & Liu method


35 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow two parents u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring network structure with at most k parents for each variables is NP-hard for k > 1

36 Heuristic Search We address the problem by using heuristic search u Define a search space: l nodes are possible structures l edges denote adjacency of structures u Traverse this space looking for high-scoring structures Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

37 Heuristic Search (cont.) u Typical operations: S C E D S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D

38 Exploiting Decomposability in Local Search u Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D

39 Greedy Hill-Climbing u Simplest heuristic local search l Start with a given network H empty network H best tree H a random network l At each iteration H Evaluate all possible changes H Apply change that leads to best improvement in score H Reiterate l Stop when no modification improves score u Each step requires evaluating approximately n new changes

40 Greedy Hill-Climbing: Possible Pitfalls u Greedy Hill-Climbing can get struck in: l Local Maxima: H All one-edge changes reduce the score l Plateaus: H Some one-edge changes leave the score unchanged H Happens because equivalent networks received the same score and are neighbors in the search space u Both occur during structure search u Standard heuristics can escape both l Random restarts l TABU search

41 Equivalence Class Search Idea: u Search the space of equivalence classes u Equivalence classes can be represented by PDAGs (partially ordered graph) Benefits: u The space of PDAGs has fewer local maxima and plateaus u There are fewer PDAGs than DAGs

42 Equivalence Class Search (cont.) Evaluating changes is more expensive u These algorithms are more complex to implement X Z YX Z YX Z Y Add Y---Z Original PDAG New PDAG Consistent DAG Score

43 Learning in Practice: Alarm domain 0 0.5 1 1.5 2 0500100015002000250030003500400045005000 KL Divergence M True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10

44 Model Selection u So far, we focused on single model l Find best scoring model l Use it to predict next example u Implicit assumption: l Best scoring model dominates the weighted sum u Pros: l We get a single structure l Allows for efficient use in our tasks u Cons: l We are committing to the independencies of a particular structure l Other structures might be as probable given the data

45 Model Averaging u Recall, Bayesian analysis started with l This requires us to average over all possible models

46 Model Averaging (cont.) u Full Averaging l Sum over all structures l Usually intractable--- there are exponentially many structures u Approximate Averaging l Find K largest scoring structures l Approximate the sum by averaging over their prediction l Weight of each structure determined by the Bayes Factor The actual score we compute

47 Search: Summary u Discrete optimization problem u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~10 min): H Decomposability H Sufficient statistics u In some cases, we can reduce the search problem to an easy optimization problem l Example: learning trees

Download ppt ". PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and."

Similar presentations

Ads by Google