المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen Sum-Product Networks المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen 1
introduction The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable? The answer leads to a new kind of deep architecture, which we call sum product networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes. We show that if an SPN is complete and consistent it represents the partition function and all marginal of some graphical model,and give semantics to its nodes .
What is a Sum-Product Network What is a Sum-Product Network? * Poon and Domingos, _ Acyclic directed graph of sums and products * Leaves can be indicator variables or univariate distributions
SPN Architecture Specific type of deep neural network – Activation function: product Advantage: – Clear semantics and well understood theory
This Talk: Sum-Product Networks Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models
Learn optimal way to reuse computation, etc. Sum-Product Networks Graphical Models Existing Tractable Models
Outline Sum-product networks (SPNs) Learning SPNs Experimental results Conclusion 7
Why Is Inference Hard? Bottleneck: Summing out variables E.g.: Partition function Sum of exponentially many products
Alternative Representation X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4 I[X1=1] I[X2=1] + 0.2 I[X1=1] I[X2=0] + 0.1 I[X1=0] I[X2=1] + 0.3 I[X1=0] I[X2=0]
Shorthand for Indicators X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4 X1 X2 + 0.2 X1 X2 + 0.1 X1 X2 + 0.3 X1 X2
Easy: Set both indicators to 1 Sum Out Variables e: X1 = 1 X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(e) = 0.4 X1 X2 + 0.2 X1 X2 + 0.1 X1 X2 + 0.3 X1 X2 Set X1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1
SUM PRODUCT NETWORK Graphical Representation X1 X2 P(X) 1 0.4 0.2 0.1 0.3 0.4 0.3 0.2 0.1 X1 X1 X2 X2
But … Exponentially Large Example: Parity Uniform distribution over states with even number of 1’s 2N-1 N2N-1 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 Can we make this more compact 13 13
But … Exponentially Large Can we make this more compact? Example: Parity Uniform distribution over states of even number of 1’s X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 14 14
Use a Deep Network Example: Parity Uniform distribution over states of even number of 1’s Induce many hidden layers Reuse partial computation 15
Sum-Product Networks (SPNs) Rooted DAG Nodes: Sum, product, input indicator Weights on edges from sum to children 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 16
Valid SPN SPN is valid if S(e) = Xe S(X) for all e Valid Can compute marginals efficiently Partition function Z can be computed by setting all indicators to 1 17
Valid SPN: General Conditions Theorem: SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete: Under sum, children cover the same set of variables Incomplete Inconsistent S(e) Xe S(X) S(e) Xe S(X) 18
Semantics of Sums and Products Product Feature Form feature hierarchy Sum Mixture (with hidden var. summed out) i i wij wij Sum out Yi j …… …… …… …… j I[Yi = j]
Handling Continuous Variables Sum Integral over input Simplest case: Indicator Gaussian SPN compactly defines a very large mixture of Gaussians
SPNs can compactly represent many more distributions SPNs Everywhere Graphical models Existing tractable models, e.g.: hierarchical mixture model, thin junction tree, etc. SPNs can compactly represent many more distributions 21
SPNs can represent, combine, and learn the optimal way SPNs Everywhere Graphical models Inference methods, e.g.: Junction-tree algorithm, message passing, … SPNs can represent, combine, and learn the optimal way 22
SPNs Everywhere Graphical models SPNs can be more compact by leveraging determinism, context-specific independence, etc. 23
SPN: First approach for learning directly from data SPNs Everywhere Graphical models Models for efficient inference E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPN: First approach for learning directly from data 24
SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional network Sum: Average-pooling Max: Max-pooling 25
Product: Production rule SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule 26
Outline Sum-product networks (SPNs) Learning SPNs Experimental results Conclusion 27
The Challenge In principle, can use gradient descent But … gradient quickly dilutes Similar problem with EM Hard EM overcomes this problem 28
Outline Sum-product networks (SPNs) Learning SPNs Experimental results Conclusion 29
Task: Image Completion Very challenging Good for evaluating deep models Methodology: Learn a model from training images Complete unseen test images Measure mean square errors
SPN Architecture ...... ...... ...... ...... x Region ...... ...... ...... ...... Pixel x
Systems SPN DBM [Salakhutdinov & Hinton, 2010] DBN [Hinton & Salakhutdinov, 2006] PCA [Turk & Pentland, 1991] Nearest neighbor [Hays & Efros, 2007] 32
Preprocessing for DBN Did not work well despite best effort Followed Hinton & Salakhutdinov [2006] Reduced scale: 64 64 25 25 Used much larger dataset: 120,000 images Reduced-scale Artificially lower errors
Caltech: Mean-Square Errors LEFT BOTTOM SPN 3551 3270 DBM 9043 9792 DBN 4778 4492 PCA 4234 4465 Nearest Neighbor 4887 5505
SPN vs. DBM / DBN SPN is order of magnitude faster No elaborate preprocessing, tuning Reduced errors by 30-60% Learned up to 46 layers SPN DBM / DBN Learning 2-3 hours Days Inference < 1 second Minutes or hours
Scatter Plot: Complete Left
Scatter Plot: Complete Bottom
Olivetti: Mean-Square Errors LEFT BOTTOM SPN 942 918 DBM 1866 2401 DBN 2386 1931 PCA 1076 1265 Nearest Neighbor 1527 1793
End-to-End Comparison Approximate General Graphical Models Approximate Data Performance Given same computation budget, which approach has better performance? Approximate Sum-Product Networks Exact Data Performance
Deep Learning Stack many layers E.g.: DBN [Hinton & Salakhutdinov,2006] CDBN [Lee et al.,2009] DBM [Salakhutdinov & Hinton,2010] Potentially much more powerful than shallow architectures [Bengio, 2009] But … Inference is even harder Learning requires extensive effort 40
Conclusion Sum-product networks (SPNs) DAG of sums and products Compactly represent partition function Learn many layers of hidden variables Exact inference: Linear time in network size Deep learning: Online hard EM Substantially outperform state of the art on image completion 41