Machine Learning of Bayesian Networks Using Constraint Programming

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
1 Finite Constraint Domains. 2 u Constraint satisfaction problems (CSP) u A backtracking solver u Node and arc consistency u Bounds consistency u Generalized.
“Using Weighted MAX-SAT Engines to Solve MPE” -- by James D. Park Shuo (Olivia) Yang.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Outline 1)Motivation 2)Representing/Modeling Causal Systems 3)Estimation and Updating 4)Model Search 5)Linear Latent Variable Models 6)Case Study: fMRI.
5-1 Chapter 5 Tree Searching Strategies. 5-2 Satisfiability problem Tree representation of 8 assignments. If there are n variables x 1, x 2, …,x n, then.
Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.
1/21 Finding Optimal Bayesian Network Structures with Constraints Learned from Data 1 City University of New York 2 University of Helsinki Xiannian Fan.
Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian.
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011.
Probabilistic networks Inference and Other Problems Hans L. Bodlaender Utrecht University.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
Utrecht, february 22, 2002 Applications of Tree Decompositions Stan van Hoesel KE-FdEWB Universiteit Maastricht
Branch and Bound Algorithm for Solving Integer Linear Programming
Two Approaches to Bayesian Network Structure Learning Goal : Compare an algorithm that learns a BN tree structure (TAN) with an algorithm that learns a.
Latent Tree Models Part II: Definition and Properties
Bayes Net Perspectives on Causation and Causal Inference
Tractable Symmetry Breaking Using Restricted Search Trees Colva M. Roney-Dougal, Ian P. Gent, Tom Kelsey, Steve Linton Presented by: Shant Karakashian.
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Bayesian Networks Martin Bachler MLA - VO
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Computing & Information Sciences Kansas State University Data Sciences Summer Institute Multimodal Information Access and Synthesis Learning and Reasoning.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
1 Bayesian Networks (Directed Acyclic Graphical Models) The situation of a bell that rings whenever the outcome of two coins are equal can not be well.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
The Visual Causality Analyst: An Interactive Interface for Causal Reasoning Jun Wang, Stony Brook University Klaus Mueller, Stony Brook University, SUNY.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Introduction on Graphic Models
A Cooperative Coevolutionary Genetic Algorithm for Learning Bayesian Network Structures Arthur Carvalho
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
Review: Discrete Mathematics and Its Applications
The NP class. NP-completeness
Qian Liu CSE spring University of Pennsylvania
The minimum cost flow problem
Courtsey & Copyright: DESIGN AND ANALYSIS OF ALGORITHMS Courtsey & Copyright:
Exact Algorithms via Monotone Local Search
Learning Bayesian Network Models from Data
Bell & Coins Example Coin1 Bell Coin2
Data Mining Lecture 11.
Markov Properties of Directed Acyclic Graphs
1.3 Modeling with exponentially many constr.
Efficient Learning using Constrained Sufficient Statistics
Estimating Networks With Jumps
Branch and Bound.
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Analysis & Design of Algorithms (CSCE 321)
Pattern Recognition and Image Analysis
CS 188: Artificial Intelligence Fall 2007
Review: Discrete Mathematics and Its Applications
An Algorithm for Bayesian Network Construction from Data
1.3 Modeling with exponentially many constr.
Chapter 15 Graphs © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
A Branch-and Bound Algorithm for MDL Learning Bayesian Networks
Constraints and Search
GRAPHS Lecture 17 CS2110 Spring 2018.
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presented by Uroš Midić
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

Machine Learning of Bayesian Networks Using Constraint Programming Peter van Beek and Hella-Franziska Hoffmann University of Waterloo

Bayesian networks Probabilistic, directed, acyclic graphical model: nodes are random variables directed arcs connect pairs of nodes intuitive meaning: if arc X  Y, X has a direct influence on Y each node has a conditional probability table specifies effects of the parents on the node Diverse applications: knowledge discovery, classification, prediction, and control

Example: Medical diagnosis of diabetes Gender Exercise Heredity Pregnancies Age Overweight Patient information & root causes Diabetes Medical difficulties & diseases BMI Glucose conc. Serum test Fatigue Diastolic BP Diagnostic tests & symptoms

Structure learning from data: score-and-search approach Scoring function (BIC/MDL, BDeu) gives possible parent sets: Combinatorial optimization problem: find a directed acyclic graph (DAG) over the random variables that minimizes the total score Gender Exercise Age Diastolic BP … Diabetes male yes middle-aged high female elderly normal no Exercise Age Exercise Age … Gender Gender Gender 17.5 20.2 19.3

Related work: Global search algorithms Dynamic programming Koivisto & Sood, JMLR, 2004 Silander & Myllymäki, UAI, 2006 Malone, Yuan & Hansen, AAAI, 2011 Integer linear programming Jaakkola et al., AISTATS, 2010 Barlett & Cussens, UAI, 2013 A* search Yuan & Malone, JAIR, 2013 Fan, Malone & Yuan, UAI, 2014 Fan & Yuan, AAAI, 2015 Breadth-first branch-and-bound search Campos & Ji, JMLR, 2011 Fan, Yuan & Malone, AAAI, 2014 Depth-first branch-and-bound search Tian, UAI, 2000 Malone & Yuan, LNCS 8323, 2014

Related work: Global search algorithms Dynamic programming Koivisto & Sood, JMLR, 2004 Silander & Myllymäki, UAI, 2006 Malone, Yuan & Hansen, AAAI, 2011 Integer linear programming Jaakkola et al., AISTATS, 2010 Barlett & Cussens, UAI, 2013 A* search Yuan & Malone, JAIR, 2013 Fan, Malone & Yuan, UAI, 2014 Fan & Yuan, AAAI, 2015 Breadth-first branch-and-bound search Campos & Ji, JMLR, 2011 Fan, Yuan & Malone, AAAI, 2014 Depth-first branch-and-bound search Tian, UAI, 2000 Malone & Yuan, LNCS 8323, 2014

Constraint model (I) Notation: Vertex (possible parent set) variables: v1, …, vn dom(vi) ⊆ 2V consists of possible parent sets for vi assignment vi = p denotes vertex vi has parents p in the graph global constraint: acyclic(v1, …, vn) satisfied iff the graph designated by the parent sets is acyclic V set of random variables n number of random variables in data set cost(v) cost (score) of variable v dom(v) domain of variable v

Constraint model (II) Ordering (permutation) variables: o1, …, on dom(oi) = {1, …, n} assignment oi = j denotes vertex vj is in position i in the total ordering global constraint: alldifferent(o1, …, on) given a permutation, it is easy to determine the minimum cost DAG Depth auxiliary variables: d1, …, dn dom(di) = {0, …, n−1} assignment di = k denotes that depth of vertex variable vj that occurs at position i in the ordering is k Channeling constraints connect the three types of variables

Symmetry-breaking constraints (I) Many permutations and prefixes of permutations are symmetric lead to the same minimum cost DAG Rule out all but the lexicographically least: Example: allowed: Exercise, Gender, Age disallowed: Gender, Age, Exercise d1 = 0 di = k ↔ (di+1 = k ˅ di+1 = k+1) i = 1, …, n−1 di = di+1 → oi < oi+1 Gender Exercise Age

Symmetry-breaking constraints (II) Identify interchangeable vertex variables identified prior to search same domains and costs (after substitution) substitutable in domains of other variables Break symmetry using lexicographic ordering

Symmetry-breaking constraints (III) I-equivalent networks: two DAGs are said to be I-equivalent if they encode the same set of conditional independence assumptions Chickering (1995, 2002) provides a local characterization: sequence of “covered” edges that can be reversed Example: Gender Exercise Gender Exercise Age Age

Dominance constraints (I) Consider an instantiation of the ordering prefix o1, …, oi A value p ∈ dom(vj) is consistent with the ordering if each element of p occurs in the ordering want lowest cost p consistent with the ordering can safely prune away all other p’ ∈ dom(vj) of higher cost

Dominance constraints (II) Teyssier and Koller (2005) present a cost-based pruning rule only applicable before search begins routinely used in score-and-search approaches We generalize the pruning rule applicable during search takes into account ordering information induced by the partial solution so far Exercise Exercise Age Gender Gender 17.5 19.3

Dominance constraints (III) Consider an instantiation of the ordering prefix o1, …, oi Let π be a permutation over {1, …, i } Cost of completing ordering prefixes o1, …, oi and oπ(1), …, oπ(i) identical basis of dynamic programming, A*, and best-first approaches Any ordering prefix o1, …, oi can be safely pruned if there exists a permutation π such that cost(oπ(1), …, oπ(i)) < cost(o1, …, oi)

Acyclic constraint: acyclic(v1, …, vn) Algorithm for checking satisfiability Based on well-known property of DAGs: a graph over vertices V is acyclic iff for every non-empty subset S ⊂ V there is at least one vertex w ∈ S with parents outside of S Test satisfiability in O(n2d) steps, where n is the number of vertices and d is an upper bound on the number of possible parent sets per vertex Enforce generalized arc consistency in O(n3d2) steps Speedup: prune based on identifying necessary arcs

Solving the constraint model Constraint-based depth-first branch-and-bound search branching over ordering variables using static order o1, …, on cost function z = cost(v1) + … + cost(vn) lower bound based on Fan and Yuan (2015) using pattern databases initial upper bound based on Teyssier and Koller (2005) using local search

Experimental results: BDeu scoring Time (sec.) to determine minimal cost BN, where n is the number of random variables, N is the number of instances in the data set, and d is the total number of possible parent sets for the random variables. Time limit of 24 hours; memory limit of 16 GB. GOBNILP A* CPBayes benchmark n N d v1.4.1 v2015 v1.0 shuttle 10 58,000 812 58.5 0.0 letter 17 20,000 18,841 5,060.8 1.3 1.4 zoo 101 2,855 177.7 0.5 0.2 vehicle 19 846 3,121 90.4 2.4 0.7 segment 20 2,310 6,491 2,486.5 3.3 mushroom 23 8,124 438,185 OT 255.5 561.8 autos 26 159 25,238 918.3 464.2 insurance 27 1,000 792 2.8 583.9 107.0 steel 28 1,941 113,118 902.9 21,547.0 flag 29 194 1,324 28.0 49.4 39.9 wdbc 31 569 13,473 2,055.6 OM 11,031.6

Experimental results: BIC scoring Time (sec.) to determine minimal cost BN, where n is the number of random variables, N is the number of instances in the data set, and d is the total number of possible parent sets for the random variables. Time limit of 24 hours; memory limit of 16 GB. GOBNILP A* CPBayes benchmark n N d v1.4.1 v2015 v1.0 letter 17 20,000 4,443 72.5 0.6 0.2 mushroom 23 8,124 13,025 82,736.2 34.4 7.7 autos 26 159 2,391 108.0 316.3 50.8 insurance 27 1,000 506 2.1 824.3 103.7 steel 28 1,941 93,026 OT 550.8 4,447.6 wdbc 31 569 14,613 1,773.7 1,330.8 1,460.5 soybean 36 266 5,926 789.5 1,114.1 147.8 spectf 45 267 610 8.4 401.7 11.2 sponge 76 618 4.1 793.5 13.2 hailfinder 56 500 418 0.5 OM 9.3 lung cancer 57 32 292 2.0 10.5 carpo 60 847 6.9

Discussion CPBayes effectively trades space for time Bayesian networks are classified as: small (20 or fewer random variables) medium (20 ‒ 60) large (60 ‒ 100) very large (100 ‒ 1000) massive (greater than 1000) Small networks are easy for A* and CPBayes, but can be challenging for GOBNILP GOBNILP scales somewhat better than CPBayes on the parameter n CPBayes scales much better than GOBNILP on the parameter d No current score-and-search method scales beyond medium instances

Future work Improve the branch-and-bound search better lower and upper bounds exploit decomposition and caching during the search All current approaches assume complete data important next step: handle missing values and latent variables