Irina Rish IBM T.J.Watson Research Center rish@us.ibm.com Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center rish@us.ibm.com
“Road map” Introduction and motivation: How to use them What are Bayesian networks and why use them? How to use them Probabilistic inference How to learn them Learning parameters Learning graph structure Summary
Bayesian Networks P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ? X-ray Bronchitis Dyspnoea P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Bayesian Networks: Representation P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) Smoking lung Cancer Bronchitis CPD: C B D=0 D=1 0 0 0.1 0.9 0 1 0.7 0.3 1 0 0.8 0.2 1 1 0.9 0.1 X-ray Dyspnoea P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) Conditional Independencies Efficient Representation
What are they good for? Diagnosis: P(cause|symptom)=? Prediction: P(symptom|cause)=? Classification: P(class|data) Decision-making (given a cost function) Medicine Bio-informatics Computer troubleshooting Stock market Text Classification Speech recognition
Example: Printer Troubleshooting
Bayesian networks: inference P(X|evidence)=? P(s|d=1) “Moral” graph S X D B C P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)= C B P(c|s)P(x|c,s)P(d|c,b) X D P(s) P(b|s) Variable Elimination W*=4 ”induced width” (max clique size) Complexity: Efficient inference: variable orderings, conditioning, approximations
“Road map” Introduction and motivation: How to use them What are Bayesian networks and why use them? How to use them Probabilistic inference Why and how to learn them Learning parameters Learning graph structure Summary
Why learn Bayesian networks? <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. Combining domain expert knowledge with data Efficient representation and inference Incremental learning: P(H) or Handling missing data: <1.3 2.8 ?? 0 1 > S C Learning causal relationships:
Learning Bayesian Networks Known graph C S B D X P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) – learn parameters Complete data: parameter estimation (ML, MAP) Incomplete data: non-linear parametric optimization (gradient descent, EM) C S B D X Unknown graph – learn graph and parameters Complete data: optimization (search in space of graphs) Incomplete data: structural EM, mixture models C S B D X
Learning Parameters: complete data ML-estimate: - decomposable! X C B Multinomial counts MAP-estimate (Bayesian statistics) Conjugate priors - Dirichlet Equivalent sample size (prior knowledge)
Learning Parameters: incomplete data Non-decomposable marginal likelihood (hidden nodes) S X D C B <? 0 1 0 1> <1 1 ? 0 1> <0 0 0 ? ?> <? ? 0 ? 1> ……… Data S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Initial parameters Current model Update parameters (ML, MAP) Maximization EM-algorithm: iterate until convergence
Learning graph structure Find NP-hard optimization Heuristic search: Greedy local search Best-first search Simulated annealing Complete data – local computations Incomplete data (score non-decomposable): Structural EM C S B Add S->B C S B C S B Delete S->B C S B Reverse S->B Constrained-based methods Data impose independence relations (constraints)
Scoring functions: Minimum Description Length (MDL) Learning data compression Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. DL(Data|model) DL(Model)
Summary Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning: parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL) Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI)) Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.