Download presentation
Presentation is loading. Please wait.
Published byAngel Hodge Modified over 9 years ago
1
CS B 351 L EARNING P ROBABILISTIC M ODELS
2
M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified Bayes net Next few lectures: where does the Bayes net come from ?
3
Win? Strength Opponent Strength
4
Win? Offense strength Opp. Off. Strength Defense strength Opp. Def. Strength Pass yds Rush yds Rush yds allowed Score allowed
5
S Win? Offense strength Opp. Off. Strength Defense strength Opp. Def. Strength Pass yds Rush yds Rush yds allowed Score allowed Strength of schedule At Home? Injuries? Opp injuries?
6
S Win? Offense strength Opp. Off. Strength Defense strength Opp. Def. Strength Pass yds Rush yds Rush yds allowed Score allowed Strength of schedule At Home? Injuries? Opp injuries?
7
A GENDA Learning probability distributions from example data Influence of structure on performance Maximum likelihood estimation (MLE) Bayesian estimation
8
P ROBABILISTIC E STIMATION PROBLEM Our setting: Given a set of examples drawn from the target distribution Each example is complete (fully observable) Goal: Produce some representation of a belief state so we can perform inferences & draw certain predictions
9
D ENSITY E STIMATION Given dataset D={ d [1],…, d [M]} drawn from underlying distribution P * Find a distribution that matches P * as “close” as possible High-level issues: Usually, not enough data to get an accurate picture of P *, which forces us to approximate. Even if we did have P *, how do we define “closeness” (both theoretically and in practice)? How do we maximize “closeness”?
10
W HAT CLASS OF P ROBABILITY M ODELS ? For small discrete distributions, just use a tabular representation Very efficient learning techniques For large discrete distributions or continuous ones, the choice of probability model is crucial Increasing complexity => Can represent complex distributions more accurately Need more data to learn well (risk of overfitting) More expensive to learn and to perform inference
11
T WO L EARNING PROBLEMS Parameter learning What entries should be put into the model’s probability tables? Structure learning Which variables should be represented / transformed for inclusion in the model? What direct / indirect relationships between variables should be modeled? More “high level” problem Once structure is chosen, a set of (unestimated) parameters emerge These need to be estimated using parameter learning
12
L EARNING C OIN F LIPS Cherry and lime candies are in an opaque bag Observe that c out of N draws are cherries (data)
13
L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag (or it might not, depending on the draw!) “Intuitive” parameter estimate: empirical distribution P(cherry) c / N (this will be justified more thoroughly later)
14
S TRUCTURE L EARNING E XAMPLE : H ISTOGRAM BUCKET SIZES Histograms are used to estimate distributions of continuous or large #s of discrete values… but how fine?
15
S TRUCTURE L EARNING : I NDEPENDENCE R ELATIONSHIPS Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D) Case 1: 15 free parameters (16 entries – sum to 1 constraint) P(A B C D) = p 1 P(A B C D) = p 2 … P( A B C D) = p 15 P( A B C D) = 1-p 1 -…-p 15 Case 2: 4 free parameters P(A)=p 1, P( A)=1-p 1 … P(D)=p 4, P( D)=1-p 4
16
S TRUCTURE L EARNING : I NDEPENDENCE R ELATIONSHIPS Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D) P(A,B,C,D) Would be able to fit ALL relationships in the data P(A)P(B)P(C)P(D) Inherently does not have the capability to accurately model correlations like A~=B Leads to biased estimates: overestimate or underestimate the true probabilities
17
Original joint distribution P(X,Y) Learned using independence assumption P(X)P(Y) X Y Y X
18
S TRUCTURE L EARNING : E XPRESSIVE P OWER Making more independence assumptions always makes a probabilistic model less expressive If the independence relationships assumed by structure model A are a superset of those in structure B, then B can express any probability distribution that A can X YZ X YZ X YZ
19
C F1F1 F2F2 FkFk C F1F1 F2F2 FkFk Or ?
20
A RCS DO NOT NECESSARILY ENCODE CAUSALITY ! A B C C B A 2 BN’s that can encode the same joint probability distribution
21
R EADING OFF INDEPENDENCE RELATIONSHIPS Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)? No! C parent’s (B) are given, and so it is independent of its non-descendents (A) Independence is symmetric: C A | B => A C | B A B C
22
L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2
23
L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2 Parameters estimated via empirical distribution (“Intuitive fit”) P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11
24
L EARNING IN THE FACE OF N OISY D ATA Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT XY Model 1 XY Model 2 Parameters estimated via empirical distribution (“Intuitive fit”) P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger!
25
S TRUCTURE L EARNING : F IT VS COMPLEXITY Must trade off fit of data vs. complexity of model Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity to noise
26
S TRUCTURE L EARNING : F IT VS COMPLEXITY Must trade off fit of data vs. complexity of model Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity to noise Typical approaches explore multiple structures, while optimizing the trade off between fit and complexity Need a way of measuring “complexity” (e.g., number of edges, number of parameters) and “fit”
27
F URTHER R EADING ON S TRUCTURE L EARNING Structure learning with statistical independence testing Score-based methods (e.g., Bayesian Information Criterion) Bayesian methods with structure priors Cross-validated model selection (more on this later)
28
S TATISTICAL P ARAMETER LEARNING
29
L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Let the unknown fraction of cherries be (hypothesis) Probability of drawing a cherry is Assumption: draws are independent and identically distributed (i.i.d)
30
L EARNING C OIN F LIPS Probability of drawing a cherry is Assumption: draws are independent and identically distributed (i.i.d) Probability of drawing 2 cherries is Probability of drawing 2 limes is (1- ) 2 Probability of drawing 1 cherry and 1 lime:
31
L IKELIHOOD F UNCTION Likelihood of data d ={d 1,…,d N } given P( d | ) = j P(d j | ) = c (1- ) N-c i.i.d assumptionGather c cherry terms together, then N-c lime terms
32
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
33
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
34
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
35
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
36
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
37
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
38
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
39
M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…
40
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE)
41
M AXIMUM L IKELIHOOD l( ) = log P( d | ) = log [ c (1- ) N-c ]
42
M AXIMUM L IKELIHOOD l( ) = log P( d | ) = log [ c (1- ) N-c ] = log [ c ] + log [(1- ) N-c ]
43
M AXIMUM L IKELIHOOD l( ) = log P( d | ) = log [ c (1- ) N-c ] = log [ c ] + log [(1- ) N-c ] = c log + (N-c) log (1- )
44
M AXIMUM L IKELIHOOD l( ) = log P( d | ) = c log + (N-c) log (1- ) Setting dl/d ( = 0 gives the maximum likelihood estimate
45
M AXIMUM L IKELIHOOD dl/d ( ) = c/ – (N-c)/(1- ) At MLE, c/ – (N-c)/(1- ) = 0 => = c/N
46
O THER MLE RESULTS Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram) Continuous Gaussian distributions Mean = average data Standard deviation = standard deviation of data
47
A N A LTERNATIVE APPROACH : B AYESIAN E STIMATION P( | d ) = P( d | ) P( ) is the posterior Distribution of hypotheses given the data P( d | ) is the likelihood P( ) is the hypothesis prior d[1]d[2]d[M]
48
A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION Assume P( ) is uniform P( | d ) = P( d | ) = 1/Z c (1- ) N-c What’s P(Y|D)? ii d[1]d[2]d[M] Y
49
A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y
50
A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y Can think of this as a “correction” using “virtual counts”
51
N ONUNIFORM PRIORS P( | d ) P( d | )P( ) = c (1- ) N-c P( ) Define, for all , the probability that I believe in 10 P( )
52
B ETA D ISTRIBUTION Beta , ( ) = -1 (1- ) -1 , hyperparameters > 0 is a normalization constant = =1 is uniform distribution
53
P OSTERIOR WITH B ETA P RIOR Posterior c (1- ) N-c P( ) = c+ -1 (1- ) N-c+ -1 = Beta +c, +N-c ( ) Prediction = mean E[ =(c+ )/(N+ + )
54
P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a= -1 heads, b= -1 tails See heads, increment a See tails, increment b Effect of prior diminishes with more data
55
C HOOSING A P RIOR Part of the design process; must be chosen according to your intuition Uninformed belief , strong belief => high
56
E XTENSIONS OF B ETA P RIORS Parameters of multi-valued (categorical) distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in practice still takes the form of “virtual counts” 01 510
57
R ECAP Learning probabilistic models Parameter vs. structure learning Single-parameter learning via coin flips Maximum Likelihood Bayesian Learning with Beta prior
58
M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003
59
F ITTING CPT S Each ML entry P(x i | pa Xi ) is given by examining counts of (x i, pa Xi ) in D and normalizing across rows of the CPT Note that for large k=| Pa Xi |, very few datapoints will share the values of pa Xi ! O(|D|/2 k ), but some values may be even rarer Large domains |Val(X i )| can also be a problem Data fragmentation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.