1 Learning the Structure of Markov Logic Networks Stanley Kok.

1 1 Learning the Structure of Markov Logic Networks Stanley Kok

2 2 Overview  Introduction  CLAUDIEN, CRFs  Algorithm Evaluation Measure Clause Construction Search Strategies Speedup Techniques  Experiments

3 3 Introduction  Richardson & Domingoes (2004) learned MLN structure in two disjoint steps : Learn FO clauses with off-the-shelf ILP system (CLAUDIEN) Learn clause weights by optimizing pseudo- likelihood  Develop algorithm : Learns FO clauses by directly optimizing pseudo-likelihood Fast enough Learns better structure than R&D, pure ILP, purely probabilistic and purely KB approaches

4 4 CLAUDIEN  CLAUsal DIscovery ENgine  Starts with trivially false clause  Repeatedly refine current clauses by adding literals  Adds clauses that satisfy min accuracy and coverage to KB true ) false m ) false f ) falseh ) false m^f ) false m ) h m ) f m^h ) false f ) hf ) mf^h ) false h ) fh ) m h ) m v f

5 5 CLAUDIEN  language bias ´ clause template Refine handcrafted KB  Example, Professor(P) ( AdvisedBy(S,P) in KB dlab_template(‘1-2:[Professor(P),Student(S)]<- AdvisedBy(S,P)’) Professor(P) v Student(S) ( AdvisedBy(S,P)

6 6 Conditional Random Fields  Markov networks used to compute P(y|x) (McCallum2003)  Model:  Features, f k e.g. “current word is capitalized and next word is Inc” y1y1 y2y2 y3y3 y n-1 ynyn x 1,x 2,…,x n … IBM hired Alice…. Org PersonMisc

7 7 CRF – Feature Induction  Set of atomic features (word=the, capitalized etc)  Starts from empty CRF  While convergence criteria is not met Create list of new features consisting of  Atomic features  Binary conjunctions of atomic features  Conjunctions of atomic features with features already in model Evaluate gain in P(y|x) of adding each feature to model Add best K features to model (100s-1000s features)

8 8 Algorithm  High-level algorithm Repeat Clauses <- FindBestClauses(MLN) Add Clauses to MLN Until Clauses =   FindBestClauses(MLN) Search for, For each candidate clause c Compute gain evaluation measure of adding c to MLN Return k clauses with highest gain and create candidate clauses

9 9 Evaluation Measure  Ideally use log-likelihood, but slow Recall: Value: Gradient:

10 10 Evaluation Measure  Use pseudo-log-likelihood (R&D(2004)), but Undue weight to predicates with large # of groundings Recall: E.g.:

11 11 Evaluation Measure  Use weighted pseudo-log-likelihood (WPLL) E.g.:

12 12 Algorithm  High-level algorithm Repeat Clauses <- FindBestClauses(MLN) Add Clauses to MLN Until Clauses =   FindBestClauses(MLN) Search for, For each candidate clause c Compute gain evaluation measure of adding c to MLN Return k clauses with highest gain and create candidate clauses

13 13 Clause Construction  Add a literal (negative/positive) All possible ways variables of new literal can be shared with those of clause !Student(S) v AdvBy(S,P)  Remove a literal (when refining MLN) Remove spurious conditions from rules !Student(S) v !YrInPgm(S,5) v TA(S,C) v TmpAdvBy(S,P)

14 14 Clause Construction  Flip signs of literals (when refining MLN) Move literals on wrong side of implication !CseQtr(C1,Q1) v !CseQtr(C2,Q2) v !SameCse(C1,C2) v !SameQtr(Q1,Q2) Beginning of algorithm Expensive, optional  Limit # of distinct variables to restrict search space

15 15 Algorithm  High-level algorithm Repeat Clauses <- FindBestClauses(MLN) Add Clauses to MLN Until Clauses =   FindBestClauses(MLN) Search for, For each candidate clause c Compute gain evaluation measure of adding c to MLN Return k clauses with highest gain and create candidate clauses

16 16 Search Strategies  Shortest-first search (SFS) 1.Find gain of each clause 2.Sort clauses by gain 3.Return top 5 with positive gain MLN wt1, !AdvBy(S,P) wt2, clause2 … 4.Add 5 clauses to MLN 5.Retrain wts of MLN candidate set 1.Find gain of each clause 2.Sort them by gain (Yikes! All length-2 clauses have gains · 0) !AdvBy(S,P) v Stu(S)

17 17 Shortest-First Search a.Extend 20 length-2 clause with highest gains b.Form new candidate set c.Keep 1000 clauses with highest gains MLN wt1, !AdvBy(S,P) wt2, clause2 … !AdvBy(S,P) v Stu(S) !AdvBy(S,P) v Stu(S) v Prof(P)

18 18 Shortest-First Search  Shortest-first search (SFS) Repeat process Extend all length-2 clauses before length-3 ones MLN wt1, clause1 wt2, clause2 … candidate set How do you refine a non-empty MLN?

19 19 SFS – MLN Refinement a.Extend 20 length-2 clause with highest gains b.Extend length-2 clauses in MLN c.Remove a predicate from length-4 clauses in MLN d.Flip signs of length-3 clauses in MLN (optional) e.b,c,d replaces original clause in MLN MLN wt1, !AdvBy(S,P) wt2, clause2 … wtA, clauseA wtB, clauseB …

20 20 Search Strategies  Beam Search 1.Keep a beam of 5 clauses with highest gains 2.Track best clause 3.Stop when best clause does not change after two consecutive iterations MLN wt1, clause1 wt2, clause2 … wtA, clauseA wtB, clauseB … How do you refine a non-empty MLN?

21 21 Algorithm  High-level algorithm Repeat Clauses <- FindBestClauses(MLN) Add Clauses to MLN Until Clauses =   FindBestClauses(MLN) Search for, For each candidate clause c Compute gain evaluation measure of adding c to MLN Return k clauses with highest gain and create candidate clauses

22 22 Difference from CRF – Feature Induction  Set of atomic features (word=the, capitalized etc)  Start from empty CRF  While convergence criteria is not met Create list of new features consisting of  Atomic features  Binary conjunctions of atomic features  Conjunctions of atomic features with features already in model Evaluate gain in P(y|x) of adding each feature to model Add best K features to model (100s-1000s features) We can refine non-empty MLN We use pseudo-likelihood; different optimizations. Applicable to arbitrary MN (not only linear chains) Maintain separate candidate set Add best ¼ 10s in model Flexible enough to fit in different search algms

23 23 Overview Introduction CLAUDIEN, CRFs Algorithm Evaluation Measure Clause Construction Search Strategies Speedup Techniques  Experiments

24 24 Speedup Techniques  Recall: FindBestClauses(MLN) Search for, and create candidate clauses For each candidate clause c Compute gain WPLL of adding c to MLN Return k clauses with highest gain  LearnWeights(MLN+c) to optimize WPLL with L-BFGS L-BFGS computes value and gradient of WPLL  Many candidate clauses; important to compute WPLL and its gradient efficiently

25 25 Speedup Techniques  WPLL:   Ignore clauses in which predicate does not appear in e.g. predicate l does not appear in clause 1 CLL

26 26 Speedup Techniques  Gnd pred’s CLL affected by clauses that contains it  Most clause weights do not  significantly Most CLLs do not much  Don’t have to recompute all CLLs Store WPLL and CLLs Recompute CLLs only if weights affecting it  beyond some threshold Subtract old CLLs and add new CLLs to WPLL

27 27 Speedup Techniques  WPLL is a sum over all ground predicates  Estimate WPLL Uniformly sampling grounding of each FO predicates  Sample x% of # groundings subject to min, max Extrapolate the average

28 28 Speedup Techniques  WPLL and its gradient Compute # true groundings of a clause #P-complete problem  Karp & Luby (1983)’s Monte-Carlo algorithm Gives estimate that is within  of true value with probability 1- Draws samples of a clause  Found that estimate converges faster than algorithm specifies Use convergence test (DeGroot & Schervish 2002) after every 100 samples Earlier termination

29 29 Speedup Techniques  L-BFGS used to learn clause weights to optimize WPLL  Two parameters: Max number of iterations Convergence Threshold  Use smaller # max iterations and looser convergence thresholds When evaluating candidate clause’s gain Faster termination

30 30 Speedup Technique  Lexicographic ordering on clauses Avoid redundant computations for clauses that are syntactically the same Don’t detect semantically identical but syntactically different clauses (NP-complete problem)  Cache new clauses Avoid recomputation

31 31 Speedup Techniques  Also used R&D04 techniques for WPLL gradient : Ignore predicates that don’t appear in i th formula Ignore ground formulas with truth value unaffected by changing truth value of any literal # true groundings of a clause computed once and cached

32 32 Overview Introduction CLAUDIEN, CRFs Algorithm Evaluation Measure Clause Construction Search Strategies Speedup Techniques  Experiments

33 33 Experiments  UW-CSE domain 22 predicates e.g. AdvisedBy, Professor etc 10 types e.g. Person, Course, Quarter etc Total # ground predicates about 4 million # true ground predicates (in DB) = 3212 Handcrafted KB with 94 formulas  Each student has at most one advisor  If a student is an author of a paper, so is her advisor etc

34 34 Experiments  Cora domain 1295 citations to 112 CS research papers Author, Venue, Title, Year fields 5 Predicates viz. SameCitation, SameAuthor, SameVenue, SameTitle, SameYear Evidence Predicates e.g.  WordsInCommonInTitle20%(title1, title2) Total # ground predicates about 5 million # true ground predicates (in DB) = 378,589 Handcrafted KB with 26 clauses  If two citations same, then they have same authors, titles etc, and vice versa  If two titles have many words in common, then they are the same, etc

35 35 Systems  MLN(KB): weight-learning applied to handcrafted KB  MLN(CL): structure-learning with CLAUDIEN; weight-learning  MLN(KB+CL): structure-learning with CLAUDIEN, using the handcrafted KB as its language bias; weight-learning  MLN(SLB): structure-learning with beam search, start from empty MLN  MLN(KB+SLB): ditto, start from handcrafted KB  MLN(SLB+KB): structure-learning with beam search, start from empty MLN, allow handcrafted clauses to be added in a first search step  MLN(SLS): structure-learning with SFS, start from empty MLN

36 36 Systems  CL: CLAUDIEN alone  KB: handcrafted KB alone  KB+CL: CLAUDIEN with KB as its language bias  NB: naïve bayes  BN: Bayesian networks

37 37 Methodology  UW-CSE domain DB divided into 5 areas: ai, graphics, languages, systems, theory Leave-one-out testing by area  Cora domain 5 different train-test splits  Measured average CLL of the predicates average area under the precision-recall curve of the predicates (AUC)

38 38 Results  MLN(SLS), MLN(SLB) better than MLN(CL), MLN(KB), CL, KB, NB, BN CLL (-ve) AUC

39 39 Results  MLN(SLS), MLN(SLB) better than MLN(CL), MLN(KB), CL, KB, NB, BN CLL AUC CLL (-ve)

40 40 Results  MLN(SLB+KB) better than MLN(KB+CL), KB+CL CLL (-ve) AUC

41 41 Results  MLN(SLB+KB) better than MLN(KB+CL), KB+CL CLL AUC CLL (-ve)

42 42 Results  MLN( ) does better than corresponding CLL (-ve) AUC

43 43 Results  MLN( ) does better than corresponding CLL AUC CLL (-ve)

44 44 Results  MLN(SLS) on UW-CSE; cluster of 15 dual- CPUs 2.8 GHz Pentium 4 machines With speed-ups: 5.3 hrs Without speed-ups: didn’t finish running in 24 hrs  MLN(SLB) on UW-CSE; on single 2.8 GHz Pentium 4 machine With speedups: 8.8 hrs Without speedups: 13.7 hrs

45 45 Future Work  Speeding up counting of # true groundings of clause  Probabilistically bounding the loss in accuracy due to subsampling  Probabilistic predicate discovery

46 46 Conclusion  Develop algorithm : Learns FO clauses by directly optimizing pseudo-likelihood Fast enough Learns better structure than R&D, pure ILP, purely probabilistic and purely KB approaches

