Download presentation
Presentation is loading. Please wait.
Published byCharlotte Parker Modified over 9 years ago
1
Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu
2
Contents Aims Microarrays Bayesian Networks Classification Methodology Results
3
Aims and Goals Investigate suitability of Bayesian Networks for analysis of Microarray data Apply Bayesian learning on Microarray data for classification Comparison with other classification techniques
4
Microarrays Array of microscopic dots representing gene expression levels Gene expression is the process of DNA genes being transcribed into RNA Short sections of genes attached to a surface such as glass or silicon Treated with dyes to obtain expression level
5
Challenges of Microarray Data Very large number of variables, low number of samples Data is noisy and incomplete Standardisation of data format ◦ MGED – MIAME, MAGE-ML, MAGE-TAB ◦ ArrayExpress, GEO, CIBEX
6
Bayesian Networks Represents conditional independencies of random variables Two components: ◦ Directed Acyclic Graph (DAG) ◦ Probability Table
7
Methodology Create a program to test accuracy of classification ◦ Written in MATLAB using Bayes Net Toolbox (Murphy, 2001), and Structure Learning Package (Leray, 2004) ◦ Uses Naive network structure, K2 structure learning, and pre- determined structure Test program on synthetic data Test program using real data Comparison of Bayes Net and Decision Tree
8
Synthetic Data Data created from well-known Bayesian Network examples ◦ Asia network, car network, and alarm network Samples generated from each network Tested with naive, pre-known structure, and with structure learning
9
Synthetic Data - Results Asia Network Lauritzen and Spiegelhalter, ‘Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems’, 1988, pg 164 Correct Naive81.0% K2 Learning83.4% Known Graph85.0% 50 Samples, 10 Folds, 100 Iterations Class Node: Dyspnoea Correct Naive83.1% K2 Learning84.3% Known Graph85.1% 100 Samples, 10 Folds, 50 Iterations Class Node: Dyspnoea
10
Synthetic Data - Results Correct Naive53.5% K2 Learning58.3% Known Graph62.4% Car Network Heckerman, et al, ‘Troubleshooting under Uncertainty’, 1994 pg 13 50 Samples, 10 Folds, 100 Iterations Class Node: Engine Starts Correct Naive56.5% K2 Learning58.7% Known Graph61.2% 100 Samples, 10 Folds, 50 Iterations Class Node: Engine Starts
11
Synthetic Data - Results ALARM Network 37 Nodes, 46 Connections Beinlich et al, ‘The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks’, 1989 Correct Naive72.4% K2 Learning78.7% Known Graph89.6% 50 Samples, 10 Folds, 10 Iterations Class Node: InsufAnesth Correct Naive69.0% K2 Learning77.8% Known Graph93.6% 50 Samples, 10 Folds, 10 Iterations Class Node: Hypovolemia
12
Lung Cancer Data Set Publically available data sets: ◦ Harvard: Bhattacharjee et al, ‘Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses’, 2001 11,657 attributes, 156 instances, Affymetrix ◦ Michigan: Beer et al, ‘Gene-Expression Profiles Predict Survival of Patients with Lung Adenocarcinoma’, 2002 6,357 attributes, 96 instances, Affymetrix ◦ Stanford: Garber et al, ‘Diversity of Gene Expression in Adenocarcinoma of the Lung’, 2001 11,985 attributes, 46 instances, cDNA Contains missing values
13
Feature Selection Li (2009) provides a feature-selected set of 90 attributes ◦ Using WEKA feature selection ◦ Also allows comparison with Decision Tree based classification Discretised data in 3 forms ◦ Undetermined values left unknown ◦ Undetermined values put into either category – two category ◦ Undetermined values put into another category – three category WEKA: Ian H. Witten and Eibe Frank, ‘Data Mining: Practical machine learning tools and techniques’, 2005.
14
Harvard Set Harvard Training on Michigan Harvard Training on Stanford MATLABWEKADTDT 2-Cat -> 2-Cat NF95 (99.0%) 2-Cat -> 2-Cat F94 (97.9%)93 (96.9%)92 (95.8%) 3-Cat -> 3-Cat NF94 (97.9%)95 (99.0%)94 (97.9%) 3-Cat -> 3-Cat F88 (91.7%)95 (99.0%)94 (97.9%) MATLABWEKADT 2-Cat -> 2-Cat NF41 (89.1%)46 (100%)43 (93.5%) 2-Cat -> 2-Cat F41 (89.1%)45 (97.8%)36 (78.3%) 3-Cat -> 3-Cat NF41 (89.1%)46 (100%)42 (91.3%) 3-Cat -> 3-Cat F41 (89.1%)46 (100%)42 (91.3%)
15
Michigan Set Michigan Training on Harvard Michigan Training on Stanford MATLABWEKADTDT 2-Cat -> 2-Cat NF150 (96.2%)154 (98.7%)153 (98.1%) 2-Cat -> 2-Cat F144 (92.3%)153 (98.1%)150 (96.2%) 3-Cat -> 3-Cat NF145 (92.9%)153 (98.1%) 3-Cat -> 3-Cat F140 (89.7%)152 (97.4%)153 (98.1%) MATLABWEKADT 2-Cat -> 2-Cat NF41 (89.1%)46 (100%)41 (89.1%) 2-Cat -> 2-Cat F41 (89.1%)46 (100%)40 (87.0%) 3-Cat -> 3-Cat NF41 (89.1%)45 (97.8%)39 (84.8%) 3-Cat -> 3-Cat F41 (89.1%)46 (100%)39 (84.8%)
16
Stanford Set Stanford Training on Harvard Stanford Training on Michigan MATLABWEKADTDT 2-Cat -> 2-Cat NF139 (89.1%)153 (98.1%)139 (89.1%) 2-Cat -> 2-Cat F139 (89.1%)150 (96.2%)124 (79.5%) 3-Cat -> 3-Cat NF139 (89.1%)150 (96.2%)154 (98.7%) 3-Cat -> 3-Cat F139 (89.1%)150 (96.2%)152 (97.4%) MATLABWEKADT 2-Cat -> 2-Cat NF86 (89.6%)95 (99.0%)86 (89.6%) 2-Cat -> 2-Cat F86 (89.6%)92 (95.8%)72 (75.0%) 3-Cat -> 3-Cat NF86 (89.6%)95 (99.0%)94 (97.9%) 3-Cat -> 3-Cat F86 (89.6%)95 (99.0%)91 (94.8%)
17
Future Work Use structure learning for Bayesian Classifiers Increase of homogeneous data Other methods of classification
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.