Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center
“Road map” Introduction and motivation: What are Bayesian networks and why use them? How to use them Probabilistic inference How to learn them Learning parameters Learning graph structure Summary
Bayesian Networks lung Cancer Smoking X-ray Bronchitis Dyspnoea P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
What are they good for? Diagnosis: P(cause|symptom)=? Medicine Bio- informatics Computer troubleshooting Stock market Text Classification Speech recognition Prediction: P(symptom|cause)=? Classification: P(class|data) Decision-making (given a cost function) cause symptom cause
Bayesian Networks: Representation = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) lung Cancer Smoking X-ray Bronchitis Dyspnoea P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) P(S, C, B, X, D) Conditional IndependenciesEfficient Representation CPD: C B D=0 D=
Example: Printer Troubleshooting
Bayesian networks: inference P(X|evidence)=? Complexity: “Moral” graph S X D B C P(s|d=1) P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)= Variable Elimination P(s)P(b|s) P(c|s)P(x|c,s)P(d|c,b) CB DX Efficient inference: variable orderings, conditioning, approximations W*=4 ”induced width” (max clique size)
“Road map” Introduction and motivation: What are Bayesian networks and why use them? How to use them Probabilistic inference Why and how to learn them Learning parameters Learning graph structure Summary
Why learn Bayesian networks? Incremental learning: P(H) or SC Learning causal relationships: Efficient representation and inference Handling missing data: ………………. Combining domain expert knowledge with data
Learning Bayesian Networks Known graph C S B D X Complete data: parameter estimation (ML, MAP) Incomplete data: non-linear parametric optimization (gradient descent, EM) P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) – learn parameters C S B D X C S B D X Unknown graph Complete data: optimization (search in space of graphs) Incomplete data: structural EM, mixture models – learn graph and parameters
Learning Parameters: complete data ML-estimate:- decomposable! MAP-estimate ( Bayesian statistics) Conjugate priors - Dirichlet X CB Multinomial counts Equivalent sample size (prior knowledge)
Learning Parameters: incomplete data EM-algorithm: iterate until convergence Initial parameters Current model Non-decomposable marginal likelihood (hidden nodes) S X D C B ……… Data S X D C B ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Update parameters (ML, MAP) Maximization
Learning graph structure NP-hard optimization Heuristic search: Greedy local search Find C S B C S B Add S->B C S B Delete S->B C S B Reverse S->B Best-first search Simulated annealing Complete data – local computations Incomplete data (score non-decomposable): Structural EM Constrained-based methods Data impose independence relations (constrains)
Scoring functions: Minimum Description Length (MDL) Learning data compression Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL DL(Model)DL(Data|model) ……………….
Summary Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning: parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL) Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN- BLT(SRI)) Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.