A Distributed Bucket Elimination Algorithm

Slides:

Advertisements

Similar presentations

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

Distributed Systems CS

Bucket Elimination: A Unifying Framework for Probabilistic Inference By: Rina Dechter Presented by: Gal Lavee.

Anagh Lal Tuesday, April 08, Chapter 9 – Tree Decomposition Methods- Part II Anagh Lal CSCE Advanced Constraint Processing.

Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

Chapter 10 Operating Systems.

PRAM (Parallel Random Access Machine)

Spark: Cluster Computing with Working Sets

Best-First Search: Agendas

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.

Mapping Techniques for Load Balancing

Parallel Processing (CS526) Spring 2012(Week 5).  There are no rules, only intuition, experience and imagination!  We consider design techniques, particularly.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Querying Structured Text in an XML Database By Xuemei Luo.

Variable and Value Ordering for MPE Search Sajjad Siddiqi and Jinbo Huang.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Parallel Computing Presented by Justin Reschke

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Apache Ignite Compute Grid Research Corey Pentasuglia.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Overview Parallel Processing Pipelining

Processes and threads.

Lecture 1 – Parallel Programming Primer

Multiway Search Trees Data may not fit into main memory

Inference in Bayesian Networks

Conception of parallel algorithms

The Echo Algorithm The echo algorithm can be used to collect and disperse information in a distributed system It was originally designed for learning network.

The minimum cost flow problem

CS4470 Computer Networking Protocols

Task Scheduling for Multicore CPUs and NUMA Systems

Parallel Algorithm Design

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Machine Learning Basics

StatSense In-Network Probabilistic Inference over Sensor Networks

18-WAN Technologies and Dynamic routing

Parallel Programming in C with MPI and OpenMP

CMSC 611: Advanced Computer Architecture

Chapter 11: Indexing and Hashing

Communication and Memory Efficient Parallel Decision Tree Construction

Unit-2 Divide and Conquer

Multiprocessors - Flynn’s taxonomy (1966)

CHAPTER 7 BAYESIAN NETWORK INDEPENDENCE BAYESIAN NETWORK INFERENCE MACHINE LEARNING ISSUES.

Indexing and Hashing Basic Concepts Ordered Indices

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Threads Chapter 4.

Introduction to parallelism and the Message Passing Interface

2018, Spring Pusan National University Ki-Joune Li

Parallel Algorithm Models

CS703 - Advanced Operating Systems

Parallel Programming in C with MPI and OpenMP

Support for Adaptivity in ARMCI Using Migratable Objects

Chapter 11: Indexing and Hashing

CSE 326: Data Structures Lecture #14

Presentation transcript:

A Distributed Bucket Elimination Algorithm Andrew Gelfand William Lam CS 230 Project – March 15, 2011

Motivation Elimination is exponential in induced width Ex: Consider a Bayesian Network (BN) with: Binary variables, k=2 Induced Width, w*=20 Largest table has 219 entries Requires only 4 MB of memory With k=3, need 9 GB of memory! Elimination (in particular BE) is a common, simple and non-confounded algorithm for exact inference Efficient, non-scalable algorithm – either solve rapidly or run out of memory => Real bottleneck of exact elimination algorithms is space!

Idea Extend available memory and processing power by using multiple machines Framework Message Passing Interface (MPI) Master/Worker Paradigm Not without cost Communication overhead Challenge: Perform BE for Pr(e) w/ a distributed system Minimize communication overhead

Bucket Elimination [Dechter96] Query: Elimination Order: d,e,b,c Original Functions Messages Bucket Tree D E D,A,B E,B,C D: E: B: C: A: B B,A,C BE is a unifying algorithmic framework for probabilistic inference that organizes computations using ‘buckets’ A bucket is an organizational device that contains a set of functions, either the original functions/CPTs or functions generated by the algorithm Say, want to compute Pr(a|e=0), given elimination order d,e,b,c Using BE, we process as follows: 1) Partition/Assign original functions/CPTs into ‘buckets’ using the specified elimination order; 2) Process from top to bottom, eliminating the variable in the bucket from subsequent computations BE is also a special case of tree elimination in which the tree-structure upon which messages are passed, the bucket tree, is determined by the variable elimination order used Nodes of tree are referred to as buckets BE processes along bucket tree from leaves to root – at each bucket performing two steps: 1) Combination; and 2) Elimination C C,A Time/Space is O(exp(w*)) A A

Processing a Bucket Σ = X X20 RAM Further illustration of combination and elimination steps Problem occurs when intermediate functions don’t fit into memory Function f1 over variables X1…X20; function f2 over variables x20…X30; combine and eliminate variable X20 yielding function f3 Recognize fact that this is true for BE and that other algorithms need only store in size of separator

Processing a Bucket Σ = X X20 Σx20 f1,1 x f2,1 = f3,1 RAM Σ = X X20 f1,1 f2,1 f3,1 f1,2 f2,2 f3,2 1 RAM Σx20 f1,1 x f2,1 = f3,1 Proc. 1 2 Σx20 f1,1 x f2,2 = f3,2 Proc. 2

Processing a Bucket Σ = X RAM Σ = X X20 Strategy: Decompose f into blocks & compute on m threads Strategy: Decompose each f into blocks and compute piecemeal on m worker nodes f1,1 f2,1 f3,1 f1,2 f2,2 f3,2 1 RAM Σx20 f1,1 x f2,1 = f3,1 Proc. 1 2 Σx20 f1,1 x f2,2 = f3,2 Proc. 2 7 7

Algorithm Design Issues How should function tables be decomposed? Task 1: Function Table Decomposition In what order should blocks be processed? Task 2: Job Scheduling

Task 1: Function Table Decomposition Variable order dictates location of entries within a table/block h(Y,X2,X1) h(X1,X2,Y) f(X1,X2)=∑Yh(Y,X2,X1) f(X1,X2) Poor Ordering Good Ordering Ordering Rule: Variable to be eliminated last in scope of h Order of remaining variables in scope of h agree with scope of f All of a tables’ entries are enumerated and assigned an index consistent with the order of the variables in the functions scope Ex: need entries 3, 12, 21 of h to compute entry 1 of f(X1,X2) Assuming blocks of h1 are only 8 entries in size, then entries (3, 12, 21) would reside in different blocks, requiring 3 load/unloads Worse yet, when we go to compute entry 2 of f(X1,X2) we will have just unloaded block 1 of h1 containing entry 6. This is thrashing at its worst.

Task 2: Job Scheduling Assign each block i to a processor j Send ‘input’ blocks as messages Computation imposes scheduling constraints Ex: Formulate as an IP Proc 1 Proc 2 D > B, E > B, B > C, C > A

Preliminary Results Tested on a small problem “pedigree1”(w* = 15)

Conclusion Described extension of BE utilizing multiple machines Identified and addressed key design issues Decomposition of functions Job scheduling Future experiments Test on more problems, specifically ones with structure suitable for parallelization