CSCI 121 Special Topics: Bayesian Networks Lecture #4: Learning in Bayes Nets
Situation #1: Known Structure, All Variables Observable EarthquakeBurglary Alarm Mary callsJohn calls FFT..FFT.. FFF..FFF.. FFT..FFT.. FFT..FFT.. FFT..FFT.. Solution: Build probability tables directly from observations.
Situation #2: Known Structure, Some Variables Unobservable EarthquakeBurglary Alarm Mary callsJohn calls FFT..FFT.. FFF..FFF.. FFT..FFT.. FFT..FFT.. Solution: Bayesian learning through Maximum Likelihood Estimation ???..???..
Bayesian Learning Given: Data D, Hypotheses H 1, H 2,... Hn Want: Prediction for unknown quantity X E.g., D = almanac for past 100 years H i = chance of rain in May = 50% X = how much rain tomorrow?
Bayesian Learning Maximum a posteriori (MAP) hypothesis H MAP : The H i that maximizes P(H i | D) Use Bayes' Rule :
Bayesian Learning For a given set of hypotheses, P(D) is fixed. So we are maximizing P(D | H i ) P(H i ). P(H i ) is a philosophical issue: e.g., Ockham's Razor (simplest consistent hypothesis is best) So it all comes down to P(D | Hi): Maximum Likelihood Estimation
Maximum Likelihood Estimation For many problems, P(D | Hi) can’t be determined analytically (in one step, using algebra). In such cases, iterate gradient methods can be used to explore the space of possibilities. For example, each Hi might be a set of conditional probability table values (a.k.a. weights) We can visualize the effect of different weight values via a “hill-climbing” metaphorvisualize Mathematically, this is called gradient descent and involves …
Maximum Likelihood Estimation For many problems, P(D | Hi) can’t be determined analytically (in one step, using algebra). In such cases, iterate gradient methods can be used to explore the space of possibilities. For example, each Hi might be a set of conditional probability table values (a.k.a. weights) We can visualize the effect of different weight values via a “hill-climbing” metaphor Mathematically, this is called gradient descent and involves …calculus!
Situation #3: Unknown Structure EarthquakeBurglary Mary callsJohn calls FFT..FFT.. FFF..FFF.. FFT..FFT.. FFT..FFT.. Solution: Hidden variables + structure learning
Learning Structure: “Hidden” Variables Hidden variables: Unknown internal factors that we posit to explain the relationship between observed input and output Why not just figure out conditional probabilities directly from observables (burglary, earthquake) to observables (John calls, Mary calls)?
Why Hidden Variables? Hidden variables can yield a more compact model, making learning easier:
Learning Structure Problem: # G(N) of graphs of N nodes goes up explosively in N: A N = 1 G(N) = 1 A N = 2 G(N) = 3 B AB AB NG(N) * Solution: Local (greedy hill-climbing) or global (Monte Carlo) algorithms