Irina Rish IBM T.J.Watson Research Center

Irina Rish IBM T.J.Watson Research Center rish@us.ibm.com
Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center

“Road map” Introduction and motivation: How to use them
What are Bayesian networks and why use them? How to use them Probabilistic inference How to learn them Learning parameters Learning graph structure Summary

Bayesian Networks P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
X-ray Bronchitis Dyspnoea P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

What are they good for? Diagnosis: P(cause|symptom)=?
Prediction: P(symptom|cause)=? Classification: P(class|data) Decision-making (given a cost function) Medicine Bio-informatics Computer troubleshooting Stock market Text Classification Speech recognition

Example: Printer Troubleshooting

“Road map” Introduction and motivation: How to use them
What are Bayesian networks and why use them? How to use them Probabilistic inference Why and how to learn them Learning parameters Learning graph structure Summary

Why learn Bayesian networks?
< > < ?? ??> < ?? > <?? ??> ………………. Combining domain expert knowledge with data Efficient representation and inference Incremental learning: P(H) or Handling missing data: < ?? > S C Learning causal relationships:

Learning Bayesian Networks
Known graph C S B D X P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) – learn parameters Complete data: parameter estimation (ML, MAP) Incomplete data: non-linear parametric optimization (gradient descent, EM) C S B D X Unknown graph – learn graph and parameters Complete data: optimization (search in space of graphs) Incomplete data: structural EM, mixture models C S B D X

Learning Parameters: complete data
ML-estimate: - decomposable! X C B Multinomial counts MAP-estimate (Bayesian statistics) Conjugate priors - Dirichlet Equivalent sample size (prior knowledge)

Learning Parameters: incomplete data
Non-decomposable marginal likelihood (hidden nodes) S X D C B <? > <1 1 ? 0 1> < ? ?> <? ? 0 ? 1> ……… Data S X D C B ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Initial parameters Current model Update parameters (ML, MAP) Maximization EM-algorithm: iterate until convergence

Learning graph structure
Find NP-hard optimization Heuristic search: Greedy local search Best-first search Simulated annealing Complete data – local computations Incomplete data (score non-decomposable): Structural EM C S B Add S->B C S B C S B Delete S->B C S B Reverse S->B Constrained-based methods Data impose independence relations (constraints)

Scoring functions: Minimum Description Length (MDL)
Learning  data compression Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL < > < ?? ??> < ?? > <?? ??> ………………. DL(Data|model) DL(Model)

Summary Bayesian Networks – graphical probabilistic models
Efficient representation and inference Expert knowledge + learning from data Learning: parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL) Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI)) Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.

Irina Rish IBM T.J.Watson Research Center

Similar presentations

Presentation on theme: "Irina Rish IBM T.J.Watson Research Center"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Irina Rish IBM T.J.Watson Research Center

Similar presentations

Presentation on theme: "Irina Rish IBM T.J.Watson Research Center"— Presentation transcript:

Similar presentations

About project

Feedback