MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer.

MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

2 Overview What is this project about? Bayesian Networks and their limitations Some techniques  Factor Analysis  Minimum Message Length (MML)  Decision Trees & Graphs  Logistic Regression Improving Bayesian Networks What is being done in this project?

3 What is this project about? The aim of the project is to enhance Bayesian Networks in general and then apply them to certain medical data-sets. These data-sets have a large number of attributes and small number of cases. This makes it difficult to model these data-sets using Bayesian Networks.

4 Bayesian Networks A popular tool for Data Mining. Model data to infer the probability of a certain outcome. They represent the frequency distributions for the values that an attribute can take as Conditional Probability Distributions. P(WS) 0.75 P(GO) 0.50 WS GOP(S | WS, GO) T T F F T F 0.01 0.80 0.40 0.99 SP(A|S) TFTF 0.95 0.00

5 Bayesian Networks - Limitations When a child node depends on a large number of parent attributes, the conditional probability distribution (CPD) becomes very complex.  2 n rows in the CPD for n binary parent attributes. This makes the process of creating the CPD and inferring something from it once created very time consuming. A more compact representation for CPDs is required.

6 Factor Analysis Multiple attributes may be defined by a common factor. The Wallace and Freeman model for Single Factor Analysis will be implemented. This serves as dimensionality reduction. The validity of the program built will be checked using the data-sets specified in the Wallace and Freeman paper. Attributes A and B have a common factor F1. Attributes C, D and E have a common factor F2.

7 Factor Analysis

8 DataAttribute related term Standard Deviation x nk = μ k + а k ν n + σ k r nk MeanRecord related term Random variates N(0,1) SizeHeightWeight LargeTallAverage LargeShortHeavy MediumAverage SmallShortLight The equation for Single Factor analysis as defined by Wallace and Freeman is:

9 The Minimum Message Length (MML) Principle Models the data as a two-part message consisting of hypothesis H and the data it encodes, D. The best model is the one with minimum message length. This is done by maximising the posterior probability of the hypothesis given the data, -log Pr(H|D), as the message length is negative log likelihood of the probability. Message is represented as: HypothesisData

10 Decision Trees and Graphs Graphical way of representing the output attribute in terms of the input attributes. Used to model the Conditional Probability Distribution of the Bayesian Network. Graphs are generalisations of decision trees. They merge similar sub-trees.

11 Logistic Regression Mathematical modelling approach used for describing the dependence of a variable on other attributes. Will be used to define the probability of a discrete target attribute as a function of continuous attributes. f(z) = 1 / (1+e -z ) + c

12 Improving Bayesian Networks Comley and Dowe (2003, 2004) based on the ideas from Dowe and Wallace (1998) commenced the work of enhancing Bayesian Networks and introduced Generalised Bayesian Networks. This project will extend on their work by applying some of the techniques described before on Bayesian Networks.

13 What is being done in this project? Refinement to Generalised Bayesian Networks. Specifically, First the MML - Single Factor Analysis will be added to Bayesian Networks. Then, Logistic Regression will be looked into. The Generalised Bayesian Networks will then be used to infer models from some medical data-sets such as breast cancer data-sets. If time permits, which it almost definitely won’t, other methods of dimensionality reduction and/or decision graphs will be pursued.

14 References J W Comley and D L Dowe: General Bayesian Networks and Asymmetric Languages, Proceedings of the 2003 Hawaii International Conference on Statistics and Related Fields (HICS 2003), Honolulu, Hawaii, USA, 5-8 June 2003, ISSN: 1539-7211, pp 1 - 18. J. W. Comley and D. L. Dowe: Minimum Message Length and Generalised Bayesian Nets with Asymmetric Languages, in P. D. Grunwald, I. J. Myung and M. A. Pitt (ed), Advances in Minimum Description Length: Theory and Applications, MIT Press. To be published 2004. D L Dowe, C S Wallace: Kolmogorov complexity, minimum message length and inverse learning, in W Robb (ed), Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), Queensland, Australia, 6-10 July, 1998, p 144. C S Wallace and P R Freeman: Single factor analysis by MML estimation, J Royal Stat. Soc. B. 54, 1, 195-209, 1992.

15 More Information http://www.monash.edu.au/~sanghi sanghi@mail.csse.monash.edu.au

16 Thank You Any questions?

MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer.

Similar presentations

Presentation on theme: "MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer.

Similar presentations

Presentation on theme: "MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer."— Presentation transcript:

Similar presentations

About project

Feedback