General Graphical Model Learning Schema After Kimmig et al. 2014. Initialize graph G := empty. While not converged do Generate candidate graphs. For each candidate graph C, learn parameters θC that maximize score(C, θ, dataset). G := argmaxC score(C, θC,dataset). check convergence criterion. relational score From Relational Statistics to Degrees of Belief
Tutorial on Learning Bayesian Networks for Complex Relational Data Parameter Learning Section 3 Tutorial on Learning Bayesian Networks for Complex Relational Data
Overview: Upgrading Parameter Learning Extend learning concepts/algorithms designed for iid data to relational data This is called upgrading iid learning (van de Laer and De Raedt) Score/Objective Function: Random Selection Likelihood Algorithm: Fast Möbius Transform van de Laer, W. & De Raedt, L. (2001), How to upgrade propositional learners to first-order logic: A case study’, in Relational Data Mining', Springer Verlag
Likelihood Function for IID Data Learning Bayesian Networks for Complex Relational Data
Score-based Learning for IID data Most Bayesian network learning methods are based on a score function The score function measures how well the network fits the observed data Key component: the likelihood function. measures how likely each datapoint is according to the Bayesian network Bayesian Network intuitively, how well the model explains each datapoint data table Log-likelihood, e.g. -3.5 Learning Bayesian Networks for Complex Relational Data
The Bayes Net Likelihood Function for IID data For each row, compute the log-likelihood for the attribute values in the row. Log-likelihood for table = sum of log-likelihoods for rows. generalizes i.i.d. case: only one first-order variable. uses random selection semantics, instantiation principle
IID Example Title Drama Action Horror Fargo T F Kill_Bill Title Drama P(Action(M.)=T) = 1 Action(Movie) Title Drama Action Horror Fargo T F Kill_Bill Drama(Movie) P(Drama(M.)=T|Action(M.)=T) = 1/2 Horror(Movie) P(Horror(M.)=F|...) = 1 Title Drama Action Horror PB ln(PB) Fargo T F 1x1/2x1 = 1/2 -0.69 Kill_Bill In this toy data table: Action is always true Horror is always false P_B is the joint probability from the Bayes net. This is the product of conditional probabilities from the CP table. Total Log-likelihood Score for Table = -1.38 Learning Bayesian Networks for Complex Relational Data
Likelihood Function for Relational Data Learning Bayesian Networks for Complex Relational Data
Wanted: a likelihood score for relational data database Problems Multiple Tables. Dependent data points Bayesian Network Log-Likelihood, e.g. -3.5 likelihood score is not necessarily normalized likelihood function = likelihood score/normalization constant (partition function) Learning Bayesian Networks for Complex Relational Data
The Random Selection Likelihood Score Randomly select a grounding/instantiation for all first-order variables in the first-order Bayesian network Compute the log-likelihood for the attributes of the selected grounding Log-likelihood score = expected log-likelihood for a random grounding Generalizes IID log-likelihood, but without independence assumption Schulte, O. (2011), A tractable pseudo-likelihood function for Bayes Nets applied to relational data, in 'SIAM SDM', pp. 462-473.
Example P(g(A)=M) = 1/2 P(ActsIn(A,M)=T|g(A)=M) = 1/4 gender(A) ActsIn(A,M) P(ActsIn(A,M)=T|g(A)=M) = 1/4 P(ActsIn(A,M)=T|g(A)=W) = 2/4 Prob A M gender(A) ActsIn(A,M) PB ln(PB) 1/8 Brad_Pitt Fargo F 3/8 -0.98 Kill_Bill Lucy_Liu W 2/8 -1.39 T Steve_Buscemi -2.08 Uma_Thurman 0.27 geo -1.32 arith data + Bayesian network -> random selection likelihood value
Observed Frequencies Maximize Random Selection Likelihood Proposition The random selection log-likelihood score is maximized by setting the Bayesian network parameters to the observed conditional frequencies gender(A) ActsIn(A,M) P(g(A)=M) = 1/2 P(ActsIn(A,M)=T|g(A)=M) = 1/4 P(ActsIn(A,M)=T|g(A)=W) = 2/4 to compute the first conditional probability: there are 4 actor-movie pairs where the actor is male (Brad Pitt x 2 + Steve Buscemi x 2). Of those 4, there is only one where the actor appears in the movie (Buscemi in Fargo) Schulte, O. (2011), A tractable pseudo-likelihood function for Bayes Nets applied to relational data, in 'SIAM SDM', pp. 462-473.
Computing Maximum Likelihood Parameter Values The Parameter Values that maximize the Random Selection Likelihood Learning Bayesian Networks for Complex Relational Data
Computing Relational Frequencies Need to compute a contingency table with instantiation counts Well researched for all true relationships SQL Count(*) Virtual Join Partition Function Reduction g(A) Acts(A,M) action(M) count M F T 3 W 2 1 Parametrized polynomial complexity in number of first-order variables. Vardi, M. Y. (1995), On the Complexity of Bounded-Variable Queries, in 'PODS', ACM Press, , pp. 266-276. Yin, X.; Han, J.; Yang, J. & Yu, P. S. (2004), CrossMine: Efficient Classification Across Multiple Database Relations, in 'ICDE'. Venugopal, D.; Sarkhel, S. & Gogate, V. (2015), Just Count the Satisfied Groundings: Scalable Local-Search and Sampling Based Inference in MLNs, in AAAI, 2015, pp. 3606--3612.
Single Relation Case How to generalize to multiple relations? For single relation compute PD (R = F) using 1-minus trick (Getoor et al. 2003) Example: PD(HasRated(User,Movie) = T) = 4.27% PD(HasRated(User,Movie) = F) = 95.73% How to generalize to multiple relations? PD(ActsIn(Actor,Movie)=F,HasRated(User,Movie)=F) Getoor, L.; Friedman, N.; Koller, D. & Taskar, B. (2003), 'Learning probabilistic models of link structure', J. Mach. Learn. Res. 3, 679--707.
The Möbius Extension Theorem for negated relations For two link types R1 R2 p4 Joint probabilities R1 R2 p3 R1 R2 p2 R1 R2 p1 R1 R2 q4 R1 q3 Möbius Parameters R2 q2 nothing q1 Learning Bayesian Networks for Complex Relational Data
The Fast Inverse Möbius Transform HasRated(U,M) = R2 ActsIn(A,M) = R1 Initial table with no false relationships Table with joint probabilities J.P. = joint probability R1 R2 J.P. T 0.2 * 0.3 0.4 1 R1 R2 J.P. T 0.2 F 0.1 * 0.4 0.6 R1 R2 J.P. T 0.2 F 0.1 0.5 - ignoring attribute conditions numbers are made up * means: nothing specified. Exercise: trace method + - + - - + + Kennes, R. & Smets, P. (1990), Computational aspects of the Möbius transformation, in 'UAI', pp. 401-416.
Parameter Learning Time Fast Inverse Möbius transform (IMT) vs Constructing complement tables using SQL Times are in seconds Möbius transform is much faster, 15-200 times Learning Bayesian Networks for Complex Relational Data
Using Presence and Absence of Relationships Find correlations between links/relationships, not just attributes given links If a user performs a web search for an item, is it likely that the user watches a movie about the item? Example of Weka-interesting association rule on Financial benchmark dataset: statement_frequency(Account) = monthly HasLoan(Account, Loan) = true Qian, Z.; Schulte, O. & Sun, Y. (2014), Computing Multi-Relational Sufficient Statistics for Large Databases, in 'Computational Intelligence and Knowledge Management (CIKM)', pp. 1249--1258.
Summary Random Selection Semantics random selection log-likelihood Maximizing values for random selection log-likelihood = observed empirical frequencies. Generalizes maximum likelihood result for IID data. Fast Möbius Transform: computes database frequencies for conjunctive formulas involving any number of negative relationships. Enables link analysis: modelling probabilistic associations that involve the presence or absence of relationships. Learning Bayesian Networks for Complex Relational Data