Download presentation
Presentation is loading. Please wait.
1
MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer
2
2 Overview What is this project about? Bayesian Networks Generalised Bayesian Networks Some tools Factor Analysis Logistic Regression Projections
3
3 What is this project about? Learn some property of medical data-sets which have high dimensionality The aim of the project is to estimate complex conditional probability distributions
4
4 Bayesian Networks A popular tool for Data Mining Models data to infer the probability of a certain outcome They represent the frequency distributions for the values that an attribute can take as Conditional Probability Distribution (CPD) P(WS) 0.75 P(GO) 0.5 WS GOP(S | WS, GO) T T F F T F 0.01 0.8 0.4 0.99 SP(A|S) TFTF 0.95 0 Conditional Probability Tables (CPTs)
5
5 Bayesian Networks - Limitations When a child node depends on a large number of parent attributes, the CPD becomes very complex When the data has high dimensionality, the CPD will be complex Large amounts of data would be required to construct the CPD as there are many cases (rows in the CPT). This is not always available There will be cases in the CPT which are not seen in the training data
6
6 Generalised Bayesian Networks Comley and Dowe (2003, 2004) based on the ideas from Dowe and Wallace (1998) introduced Generalised Bayesian Networks This project extends their work
7
7 What was done in this project? Additions to Generalised Bayesian Networks Factor Analysis The real model might be dependent on some factors (height, weight size) Also, reduces dimensionality Logistic Regression Gives the dependence of a binary attribute on other attributes CPDs can be represented as a logistic regression function Gives compact approximations for CPDs Projections Help visualise the medical data (original dimensionality around 30,000)
8
The Tools Factor Analysis Logistic Regression Projections
9
9 The Minimum Message Length (MML) Principle Models the data as a two-part message consisting of hypothesis H and the data it encodes, D. The best model is the one with minimum message length. This is done by maximising the posterior probability of the hypothesis given the data, - log Pr(H|D), as the message length is negative log likelihood of the probability. Message is represented as: HypothesisData The length of the message is: - log (prior)- log likelihood
10
10 Factor Analysis Multiple attributes may be defined by a common factor. Representing factors will result in a more compact Bayesian Network. The Wallace and Freeman model for Single Factor Analysis was implemented. The validity of the program built was checked using the artificial and real world data- sets specified in the Wallace and Freeman paper. SizeHeightWeight LargeTallAverage LargeShortHeavy MediumAverage SmallShortLight
11
11 Factor Analysis Attributes A1 and A2 have a common factor F1 Attributes A3, A4 and A5 have a common factor F2 The equation for the model is DataAttribute related term Standard Deviation x nk = μ k + а k ν n + σ k r nk MeanRecord related term Random variates N(0,1)
12
12 Results – Factor Analysis
13
13 Results – Factor Analysis
14
14 Results - Factor Analysis No Factor
15
15 Results - Factor Analysis
16
16 Logistic Regression Mathematical modelling approach used for describing the dependence of a variable on other attributes Used to define a discrete target attribute as a function of continuous attributes Gives a compact approximation for Conditional Probability Distribution
17
17 Logistic Regression The equation for the model is Pr(Y i = 1) = e β 0 + β 1 X i 1 + e β 0 + β 1 X i Target binary attributeParameters Parent attribute (continuous) In previous example X i = temperature Pr(Y i = 1) = probability of fire
18
18 Projections Medical data-sets have high dimensionality (approximately 30,000) Impossible to visualise Projecting to lower dimensions (2D) will help visualise these data-sets
19
19 Projections Based on ideas from Yang (2003) A Minimum Cost Spanning Tree (MCST) of the data set is created Points are laid out in 2D* by preserving its distance exactly from two nearest neighbours which have been previously laid out After the graph is created, a new point can be laid out by preserving its distances from its two nearest neighbours * Generalises to lower dimensions other than 2D
20
20 Results - Projections 3D Data - XY Plane3D Data - YZ Plane 3D Data - XZ Plane 2D Projection
21
21 Results - Projections Projection of Central Nervous System data Dimensionality 74; No of observations 60
22
22 What is being done in this project? Single factor analysis tool was created Logistic regression tool is being created Tool for projecting to lower dimensions is being created (currently projects to 2D, not tested for correctness) Incorporation of these tools to the program for creating Generalised Bayesian Networks may not be done due to constraints
23
23 References J W Comley and D L Dowe: General Bayesian Networks and Asymmetric Languages, Proceedings of the 2003 Hawaii International Conference on Statistics and Related Fields (HICS 2003), Honolulu, Hawaii, USA, 5-8 June 2003, ISSN: 1539-7211, pp 1 - 18. J W Comley and D L Dowe: Minimum Message Length and Generalised Bayesian Nets with Asymmetric Languages, in P. D. Grunwald, I. J. Myung and M. A. Pitt (ed), Advances in Minimum Description Length: Theory and Applications, MIT Press. To be published 2004. D L Dowe, C S Wallace: Kolmogorov complexity, minimum message length and inverse learning, in W Robb (ed), Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), Queensland, Australia, 6-10 July, 1998, p 144. C S Wallace and P R Freeman: Single factor analysis by MML estimation, J Royal Stat. Soc. B. 54, 1, 195-209, 1992. Li Yang: Distance-preserving projection of high dimensional data. Pattern Recognition Letters, 25(2):259-266, 2004.
24
24 Thank You Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.