© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-2 Bayesian Networks Qualitative part: statistical independence statements (causality!) u Directed acyclic graph (DAG) l Nodes - random variables of interest (exhaustive and mutually exclusive states) l Edges - direct (causal) influence Quantitative part: Local probability models. Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-3 Learning Bayesian networks (reminder) Inducer Data + Prior information E R B A C.9.1 e b e be b b e BEP(A | E,B)
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-4 The Learning Problem
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-5 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-6 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-7 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-8 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-9 Learning Parameters for the Burglary Story E B A C i.i.d. samples Network factorization We have 4 independent estimation problems
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-10 Incomplete Data Data is often incomplete u Some variables of interest are not assigned value This phenomena happen when we have u Missing values u Hidden variables
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-11 Missing Values u Examples: u Survey data u Medical records l Not all patients undergo all possible tests
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-12 Missing Values (cont.) Complicating issue: u The fact that a value is missing might be indicative of its value l The patient did not undergo X-Ray since she complained about fever and not about broken bones…. To learn from incomplete data we need the following assumption: Missing at Random (MAR): The probability that the value of X i is missing is independent of its actual value given other observed values
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-13 Hidden (Latent) Variables u Attempt to learn a model with variables we never observe l In this case, MAR always holds u Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-14 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables a serious problem HY
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-15 Gradient Ascent u Main result Requires computation: P(x i,Pa i |o[m], ) for all i, m u Pros: l Flexible l Closely related to methods in neural network training u Cons: l Need to project gradient onto space of legal parameters l To get reasonable convergence we need to combine with “smart” optimization techniques
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-16 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT T??THT??TH HTHTHTHT HHTTHHTT P(Y=H|X=T, ) = 0.4 Expected Counts P(Y=H|X=H,Z=T, ) = 0.3 Data Current model
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-17 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G, 0 ) Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G, 1 ) (M-Step) Reiterate
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-18 EM (cont.) Formal Guarantees: L( 1 :D) L( 0 :D) l Each iteration improves the likelihood If 1 = 0, then 0 is a stationary point of L( :D) l Usually, this means a local maximum Main cost: u Computations of expected counts in E-Step u Requires a computation pass for each instance in training set l These are exactly the same as for gradient ascent!
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-19 Example: EM in clustering u Consider clustering example E-Step: Compute P(C[m]|X 1 [m],…,X n [m], ) l This corresponds to “soft” assignment to clusters l Compute expected statistics: M-Step Re-estimate P(X i |C), P(C) Cluster X1X1... X2X2 XnXn
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-20 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones Speed up: u various methods to speed convergence
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-21 Why Struggle for Accurate Structure u Increases the number of parameters to be fitted u Wrong assumptions about causality and domain structure u Cannot be compensated by accurate fitting of parameters u Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-22 Minimum Description Length (cont.) u Computing the description length of the data, we get u Minimizing this term is equivalent to maximizing # bits to encode G # bits to encode G # bits to encode D using (G, G )
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-23 Heuristic Search u We address the problem by using heuristic search u Define a search space: l nodes are possible structures l edges denote adjacency of structures u Traverse this space looking for high-scoring structures Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-24 Heuristic Search (cont.) u Typical operations: S C E D S C E D S C E D S C E D Add C D Reverse C E Remove C E
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-25 Exploiting Decomposability in Local Search u Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-26 Greedy Hill-Climbing Simplest heuristic local search l Start with a given network empty network best tree a random network l At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate l Stop when no modification improves score Each step requires evaluating approximately n new changes
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-27 Greedy Hill-Climbing (cont.) u Greedy Hill-Climbing can get struck in: l Local Maxima: All one-edge changes reduce the score l Plateaus: Some one-edge changes leave the score unchanged u Both are occur in the search space
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-28 Greedy Hill-Climbing (cont.) To avoid these problems, we can use: u TABU-search Keep list of K most recently visited structures l Apply best move that does not lead to a structure in the list This escapes plateaus and local maxima and with “basin” smaller than K structures u Random Restarts l Once stuck, apply some fixed number of random edge changes and restart search l This can escape from the basin of one maxima to another
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-29 Other Local Search Heuristics u Stochastic First-Ascent Hill-Climbing l Evaluate possible changes at random l Apply the first one that leads “uphill” l Stop when a fix amount of “unsuccessful” attempts to change the current candidate u Simulated Annealing l Similar idea, but also apply “downhill” changes with a probability that is proportional to the change in score l Use a temperature to control amount of random downhill steps l Slowly “cool” temperature to reach a regime where performing strict uphill moves
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-30 Examples u Predicting heart disease l Features: cholesterol, chest pain, angina, age, etc. l Class: {present, absent} u Finding lemons in cars l Features: make, brand, miles per gallon, acceleration,etc. l Class: {normal, lemon} u Digit recognition l Features: matrix of pixel descriptors l Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} u Speech recognition l Features: Signal characteristics, language model l Class: {pause/hesitation, retraction}
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-31 Some Applications u Biostatistics -- Medical Research Council (Bugs) u Data Analysis -- NASA (AutoClass) u Collaborative filtering -- Microsoft (MSBN) u Fraud detection -- ATT u Classification -- SRI (TAN-BLT) u Speech recognition -- UC Berkeley