Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu
Contents Aims Microarrays Bayesian Networks Classification Methodology Results
Aims and Goals Investigate suitability of Bayesian Networks for analysis of Microarray data Apply Bayesian learning on Microarray data for classification Comparison with other classification techniques
Microarrays Array of microscopic dots representing gene expression levels Gene expression is the process of DNA genes being transcribed into RNA Short sections of genes attached to a surface such as glass or silicon Treated with dyes to obtain expression level
Challenges of Microarray Data Very large number of variables, low number of samples Data is noisy and incomplete Standardisation of data format ◦ MGED – MIAME, MAGE-ML, MAGE-TAB ◦ ArrayExpress, GEO, CIBEX
Bayesian Networks Represents conditional independencies of random variables Two components: ◦ Directed Acyclic Graph (DAG) ◦ Probability Table
Methodology Create a program to test accuracy of classification ◦ Written in MATLAB using Bayes Net Toolbox (Murphy, 2001), and Structure Learning Package (Leray, 2004) ◦ Uses Naive network structure, K2 structure learning, and pre- determined structure Test program on synthetic data Test program using real data Comparison of Bayes Net and Decision Tree
Synthetic Data Data created from well-known Bayesian Network examples ◦ Asia network, car network, and alarm network Samples generated from each network Tested with naive, pre-known structure, and with structure learning
Synthetic Data - Results Asia Network Lauritzen and Spiegelhalter, ‘Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems’, 1988, pg 164 Correct Naive81.0% K2 Learning83.4% Known Graph85.0% 50 Samples, 10 Folds, 100 Iterations Class Node: Dyspnoea Correct Naive83.1% K2 Learning84.3% Known Graph85.1% 100 Samples, 10 Folds, 50 Iterations Class Node: Dyspnoea
Synthetic Data - Results Correct Naive53.5% K2 Learning58.3% Known Graph62.4% Car Network Heckerman, et al, ‘Troubleshooting under Uncertainty’, 1994 pg Samples, 10 Folds, 100 Iterations Class Node: Engine Starts Correct Naive56.5% K2 Learning58.7% Known Graph61.2% 100 Samples, 10 Folds, 50 Iterations Class Node: Engine Starts
Synthetic Data - Results ALARM Network 37 Nodes, 46 Connections Beinlich et al, ‘The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks’, 1989 Correct Naive72.4% K2 Learning78.7% Known Graph89.6% 50 Samples, 10 Folds, 10 Iterations Class Node: InsufAnesth Correct Naive69.0% K2 Learning77.8% Known Graph93.6% 50 Samples, 10 Folds, 10 Iterations Class Node: Hypovolemia
Lung Cancer Data Set Publically available data sets: ◦ Harvard: Bhattacharjee et al, ‘Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses’, 2001 11,657 attributes, 156 instances, Affymetrix ◦ Michigan: Beer et al, ‘Gene-Expression Profiles Predict Survival of Patients with Lung Adenocarcinoma’, 2002 6,357 attributes, 96 instances, Affymetrix ◦ Stanford: Garber et al, ‘Diversity of Gene Expression in Adenocarcinoma of the Lung’, 2001 11,985 attributes, 46 instances, cDNA Contains missing values
Feature Selection Li (2009) provides a feature-selected set of 90 attributes ◦ Using WEKA feature selection ◦ Also allows comparison with Decision Tree based classification Discretised data in 3 forms ◦ Undetermined values left unknown ◦ Undetermined values put into either category – two category ◦ Undetermined values put into another category – three category WEKA: Ian H. Witten and Eibe Frank, ‘Data Mining: Practical machine learning tools and techniques’, 2005.
Harvard Set Harvard Training on Michigan Harvard Training on Stanford MATLABWEKADTDT 2-Cat -> 2-Cat NF95 (99.0%) 2-Cat -> 2-Cat F94 (97.9%)93 (96.9%)92 (95.8%) 3-Cat -> 3-Cat NF94 (97.9%)95 (99.0%)94 (97.9%) 3-Cat -> 3-Cat F88 (91.7%)95 (99.0%)94 (97.9%) MATLABWEKADT 2-Cat -> 2-Cat NF41 (89.1%)46 (100%)43 (93.5%) 2-Cat -> 2-Cat F41 (89.1%)45 (97.8%)36 (78.3%) 3-Cat -> 3-Cat NF41 (89.1%)46 (100%)42 (91.3%) 3-Cat -> 3-Cat F41 (89.1%)46 (100%)42 (91.3%)
Michigan Set Michigan Training on Harvard Michigan Training on Stanford MATLABWEKADTDT 2-Cat -> 2-Cat NF150 (96.2%)154 (98.7%)153 (98.1%) 2-Cat -> 2-Cat F144 (92.3%)153 (98.1%)150 (96.2%) 3-Cat -> 3-Cat NF145 (92.9%)153 (98.1%) 3-Cat -> 3-Cat F140 (89.7%)152 (97.4%)153 (98.1%) MATLABWEKADT 2-Cat -> 2-Cat NF41 (89.1%)46 (100%)41 (89.1%) 2-Cat -> 2-Cat F41 (89.1%)46 (100%)40 (87.0%) 3-Cat -> 3-Cat NF41 (89.1%)45 (97.8%)39 (84.8%) 3-Cat -> 3-Cat F41 (89.1%)46 (100%)39 (84.8%)
Stanford Set Stanford Training on Harvard Stanford Training on Michigan MATLABWEKADTDT 2-Cat -> 2-Cat NF139 (89.1%)153 (98.1%)139 (89.1%) 2-Cat -> 2-Cat F139 (89.1%)150 (96.2%)124 (79.5%) 3-Cat -> 3-Cat NF139 (89.1%)150 (96.2%)154 (98.7%) 3-Cat -> 3-Cat F139 (89.1%)150 (96.2%)152 (97.4%) MATLABWEKADT 2-Cat -> 2-Cat NF86 (89.6%)95 (99.0%)86 (89.6%) 2-Cat -> 2-Cat F86 (89.6%)92 (95.8%)72 (75.0%) 3-Cat -> 3-Cat NF86 (89.6%)95 (99.0%)94 (97.9%) 3-Cat -> 3-Cat F86 (89.6%)95 (99.0%)91 (94.8%)
Future Work Use structure learning for Bayesian Classifiers Increase of homogeneous data Other methods of classification