Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sum-Product Networks: A New Deep Architecture

Similar presentations


Presentation on theme: "Sum-Product Networks: A New Deep Architecture"— Presentation transcript:

1 Sum-Product Networks: A New Deep Architecture
Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1

2 Graphical Models: Challenges
Restricted Boltzmann Machine (RBM) Bayesian Network Markov Network Sprinkler Rain Grass Wet Advantage: Compactly represent probability Problem: Inference is intractable Problem: Learning is difficult 2

3 Deep Learning Stack many layers
E.g.: DBN [Hinton & Salakhutdinov,2006] CDBN [Lee et al.,2009] DBM [Salakhutdinov & Hinton,2010] Potentially much more powerful than shallow architectures [Bengio, 2009] But … Inference is even harder Learning requires extensive effort 3

4 Learning: Requires approximate inference Inference: Still approximate
Graphical Models

5 Problem: Too restricted
E.g., hierarchical mixture model, thin junction tree, etc. Problem: Too restricted Graphical Models Existing Tractable Models

6 This Talk: Sum-Product Networks
Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models

7 Exact inference linear time in network size
Sum-Product Networks Graphical Models Exact inference linear time in network size Existing Tractable Models

8 Can compactly represent many more distributions
Sum-Product Networks Graphical Models Existing Tractable Models

9 Learn optimal way to reuse computation, etc.
Sum-Product Networks Graphical Models Existing Tractable Models

10 Outline Sum-product networks (SPNs) Learning SPNs Experimental results
Conclusion 10

11 Why Is Inference Hard? Bottleneck: Summing out variables
E.g.: Partition function Sum of exponentially many products

12 Alternative Representation
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  I[X1=1]  I[X2=1] + 0.2  I[X1=1]  I[X2=0] + 0.1  I[X1=0]  I[X2=1] + 0.3  I[X1=0]  I[X2=0]

13 Alternative Representation
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  I[X1=1]  I[X2=1] + 0.2  I[X1=1]  I[X2=0] + 0.1  I[X1=0]  I[X2=1] + 0.3  I[X1=0]  I[X2=0]

14 Shorthand for Indicators
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  X1  X2 + 0.2  X1  X2 + 0.1  X1  X2 + 0.3  X1  X2

15 Easy: Set both indicators to 1
Sum Out Variables e: X1 = 1 X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(e) = 0.4  X1  X2 + 0.2  X1  X2 + 0.1  X1  X2 + 0.3  X1  X2 Set X1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1

16 Graphical Representation
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 0.4 0.3 0.2 0.1 X1 X1 X2 X2

17 But … Exponentially Large
Example: Parity Uniform distribution over states with even number of 1’s 2N-1 N2N-1 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 17 17

18 But … Exponentially Large
Can we make this more compact? Example: Parity Uniform distribution over states of even number of 1’s X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 18 18

19 Use a Deep Network Example: Parity
Uniform distribution over states with even number of 1’s O(N) 19

20 Induce many hidden layers
Use a Deep Network Example: Parity Uniform distribution over states of even number of 1’s Induce many hidden layers 20

21 Reuse partial computation
Use a Deep Network Example: Parity Uniform distribution over states of even number of 1’s Reuse partial computation 21

22 Sum-Product Networks (SPNs)
Rooted DAG Nodes: Sum, product, input indicator Weights on edges from sum to children 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 22

23 Distribution Defined by SPN
P(X)  S(X) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 23

24 Distribution Defined by SPN
P(X)  S(X) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 24 24

25 Can We Sum Out Variables?
P(e)  Xe S(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 e: X1 = 1 1 1 1 25

26 Can We Sum Out Variables?
P(e) = Xe P(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 e: X1 = 1 1 1 1 26

27 Can We Sum Out Variables?
P(e)  Xe S(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 e: X1 = 1 1 1 1 27 27

28 Can We Sum Out Variables?
= P(e)  Xe S(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 e: X1 = 1 1 1 1 28

29 Valid SPN SPN is valid if S(e) = Xe S(X) for all e
Valid  Can compute marginals efficiently Partition function Z can be computed by setting all indicators to 1 29

30 Valid SPN: General Conditions
Theorem: SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete: Under sum, children cover the same set of variables Incomplete Inconsistent S(e)  Xe S(X) S(e)  Xe S(X) 30

31 Semantics of Sums and Products
Product  Feature  Form feature hierarchy Sum  Mixture (with hidden var. summed out) i i wij wij Sum out Yi j …… …… …… …… j I[Yi = j]

32 Probability: P(X) = S(X) / Z
Inference Probability: P(X) = S(X) / Z 0.51 X: X1 = 1, X2 = 0 0.7 0.3 X1 1 X2 0.42 0.72 0.6 0.9 0.7 0.8 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 1 1

33 If weights sum to 1 at each sum node
Inference If weights sum to 1 at each sum node Then Z = 1, P(X) = S(X) 0.51 X: X1 = 1, X2 = 0 0.7 0.3 X1 1 X2 0.42 0.72 0.6 0.9 0.7 0.8 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 1 1

34 Marginal: P(e) = S(e) / Z
Inference Marginal: P(e) = S(e) / Z 0.69 = 0.51  0.18 e: X1 = 1 0.7 0.3 X1 1 X2 0.6 0.9 0.6 0.9 1 1 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 1 1 1

35 MAP: Replace sums with maxs
Inference MAP: Replace sums with maxs e: X1 = 1 0.7  0.42 = 0.294 MAX 0.3  0.72 = 0.216 0.7 0.3 X1 1 X2 0.42 0.72 0.6 0.9 0.7 0.8 MAX MAX MAX MAX 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 1 1 1

36 MAX: Pick child with highest value
Inference MAX: Pick child with highest value MAP State: X1 = 1, X2 = 0 e: X1 = 1 0.7  0.42 = 0.294 MAX 0.3  0.72 = 0.216 0.7 0.3 X1 1 X2 0.42 0.72 0.6 0.9 0.7 0.8 MAX MAX MAX MAX 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X1 X2 X2 1 1 1

37 Handling Continuous Variables
Sum  Integral over input Simplest case: Indicator  Gaussian SPN compactly defines a very large mixture of Gaussians

38 SPNs can compactly represent many more distributions
SPNs Everywhere Graphical models Existing tractable models, e.g.: hierarchical mixture model, thin junction tree, etc. SPNs can compactly represent many more distributions 38

39 SPNs can represent, combine, and learn the optimal way
SPNs Everywhere Graphical models Inference methods, e.g.: Junction-tree algorithm, message passing, … SPNs can represent, combine, and learn the optimal way 39

40 SPNs Everywhere Graphical models
SPNs can be more compact by leveraging determinism, context-specific independence, etc. 40

41 SPN: First approach for learning directly from data
SPNs Everywhere Graphical models Models for efficient inference E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPN: First approach for learning directly from data 41

42 SPNs Everywhere Graphical models Models for efficient inference
General, probabilistic convolutional network Sum: Average-pooling Max: Max-pooling 42

43 Product: Production rule
SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule 43

44 Outline Sum-product networks (SPNs) Learning SPNs Experimental results
Conclusion 44

45 General Approach Start with a dense SPN
Find the structure by learning weights Zero weights signify absence of connections Can also learn with EM Each sum node is a mixture over children

46 The Challenge In principle, can use gradient descent
But … gradient quickly dilutes Similar problem with EM Hard EM overcomes this problem 46

47 Our Learning Algorithm
Online learning  Hard EM Sum node maintains counts for each child For each example Find MAP instantiation with current weights Increment count for each chosen child Renormalize to set new weights Repeat until convergence 47

48 Outline Sum-product networks (SPNs) Learning SPNs Experimental results
Conclusion 48

49 Task: Image Completion
Very challenging Good for evaluating deep models Methodology: Learn a model from training images Complete unseen test images Measure mean square errors

50 Datasets Main evaluation: Caltech-101 [Fei-Fei et al., 2004]
101 categories, e.g., faces, cars, elephants Each category: 30 – 800 images Also, Olivetti [Samaria & Harter, 1994] (400 faces) Each category: Last third for test Test images: Unseen objects

51 SPN Architecture ...... Pixel x

52 SPN Architecture ...... Pixel x 52

53 SPN Architecture          ...... ...... ...... ...... x
Region ...... ...... ...... ...... Pixel x

54 Decomposition

55 Decomposition …… ……

56 SPN Architecture              ...... ...... ...... ......
Whole Image ...... Region ...... ...... ...... Pixel x

57 Systems SPN DBM [Salakhutdinov & Hinton, 2010]
DBN [Hinton & Salakhutdinov, 2006] PCA [Turk & Pentland, 1991] Nearest neighbor [Hays & Efros, 2007] 57

58 Preprocessing for DBN Did not work well despite best effort
Followed Hinton & Salakhutdinov [2006] Reduced scale: 64  64  25  25 Used much larger dataset:  120,000 images Reduced-scale  Artificially lower errors

59 Caltech: Mean-Square Errors
LEFT BOTTOM SPN 3551 3270 DBM 9043 9792 DBN 4778 4492 PCA 4234 4465 Nearest Neighbor 4887 5505

60 SPN vs. DBM / DBN SPN is order of magnitude faster
No elaborate preprocessing, tuning Reduced errors by 30-60% Learned up to 46 layers SPN DBM / DBN Learning 2-3 hours Days Inference < 1 second Minutes or hours

61 Scatter Plot: Complete Left

62 Scatter Plot: Complete Bottom

63 Olivetti: Mean-Square Errors
LEFT BOTTOM SPN 942 918 DBM 1866 2401 DBN 2386 1931 PCA 1076 1265 Nearest Neighbor 1527 1793

64 Example Completions Original SPN DBM DBN PCA Nearest Neighbor

65 Open Questions Other learning algorithms Discriminative learning
Architecture Continuous SPNs Sequential domains Other applications 65

66 End-to-End Comparison
Approximate General Graphical Models Approximate Data Performance Given same computation budget, which approach has better performance? Approximate Sum-Product Networks Exact Data Performance

67 Conclusion Sum-product networks (SPNs)
DAG of sums and products Compactly represent partition function Learn many layers of hidden variables Exact inference: Linear time in network size Deep learning: Online hard EM Substantially outperform state of the art on image completion 67


Download ppt "Sum-Product Networks: A New Deep Architecture"

Similar presentations


Ads by Google