1 Decision tree based classifications of heterogeneous lung cancer data Student: Yi LI Supervisor: Associate Prof. Jiuyong Li Data: 15 th May 2009.

Slides:



Advertisements
Similar presentations
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Yue Han and Lei Yu Binghamton University.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Wenting Zhou, Weichen Wu, Nathan Palmer, Emily Mower, Noah Daniels, Lenore Cowen, Anselm Blumer Tufts University Microarray Data.
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Microarrays Dr Peter Smooker,
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Gene expression patterns of breast cancer phenotype revealed by molecular profiling Gabriela Alexe, IBM Research DIMACS Workshop on Detecting and Processing.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Spanish Inquisition Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas Chris Yan Leon.
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Emergent Biology Through Integration and Mining Of Microarray Datasets Lance D. Miller GIS Microarray & Expression Genomics.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
Whole Genome Expression Analysis
From motif search to gene expression analysis
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.
by B. Zadrozny and C. Elkan
Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
Basic Data Mining Technique
Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu.
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Artificial Intelligence Project #3 : Analysis of Decision Tree Learning Using WEKA May 23, 2006.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
S. F. Molaeezadeh-31 may 2008Gene expression modeling through positive Boolean functions 1 Seminar Title: Gene expression modeling through positive Boolean.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Artificial Intelligence Project #3 : Diagnosis Using Bayesian Networks May 19, 2005.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Gene expression. Gene Expression 2 protein RNA DNA.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Classification Using Top Scoring Pair Based Methods Tina Gui.
G Mustacchi 1, F Zanconati 2, D Bonifacio 2, L Morandi 3, MP Sormani 4, A. Gennari 5, P Bruzzi 4 1: Centro Oncolgico University of Trieste 2: Inst of Pathology,
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Data Science Credibility: Evaluating What’s Been Learned
Classification with Gene Expression Data
Gene expression.
A Unifying View on Instance Selection
Discriminative Frequent Pattern Analysis for Effective Classification
CSCI N317 Computation for Scientific Applications Unit Weka
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 Decision tree based classifications of heterogeneous lung cancer data Student: Yi LI Supervisor: Associate Prof. Jiuyong Li Data: 15 th May 2009

2 Outline Microarray data Motivations Related work Our integrated framework Experiments Discussions Conclusion

3 Microarray data Microarray rationale (Babu, 2004b)

4 Microarray data (con’d) Snapshot of DNA oligonucleotides Reveal rich biological information: DNA sequences, cell structures & cancer Hugh amount of data: Number of attributes  in thousands or more Number of samples  in hundreds or less

Microarray data (con’d) 5 its values Gene name Patient samples A part of a microarray data set

Motivations Key goal: to find out reliable and robust predictors (gene sets) However, microarray studies addressing similar prediction tasks report different sets of predictive genes 6

Motivations (con’d) Two-dimensional cluster analysis + leave- one-out cross-validation [van’t Veer et al. (2002)] Cox’s proportional-hazards regression + clustering [Wang et al. (2005)] 7

Research question How to build up a framework to improve the prediction accuracy among heterogeneous microarray data sets? 8

Dilemma 1 Usually a microarray data set contains thousands of features, but with limited number of samples. It creates troubles to expect robust and reliable classifiers. 9

Related work Curse of data set sparsity + curse of dimensionality [Somorjai et al. (2003)] ◦ Use simple classifiers to show how those curses influence outcomes ◦ Samples per feature ratio (SFR) in microarray data set is too small to expect robust classifiers. ◦ Conventional solutions: feature redundant, apply classifiers that do not require feature space redundant. 10

Related work Probably approximately correct sorting (PAC) [Ein-Dor et al. (2006)] ◦ Use PAC to evaluate the robustness of results ◦ Determine the number of samples that are required to achieve any desired level of reproducibility 11

Dilemma 2 Heterogeneous microarray platforms, differences in equipment and protocols, and differences in the analysis methods may also cause discordance between independent experiments. 12

Related work Correlation and concordance calculations [Kuo et al. (2002)] Median rank scores + quantile discretization + SVM [Warnat et al. (2005)] ◦ Stanford type cDNA microarrays and Affymetrix oligonucleotide microarrays 13

Dilemma 3 Eliminating the factors mentioned in dilemma 1 & 2, the discrepancies between studies still remain. 14

Related work Expand standard strategy to multiple sets [Michiels et al. (2005)] SVM-RFE + 5-fold cross-validation + joint-core [Fishel, I et al. (2007)] ◦ There are many optimal predictive gene sets, which are strongly dependent on the subset of samples chosen for training. 15

Research goal Our purpose to build a robust and reliable model to study heterogeneous microarray data sets, to reduce study-specific biases, and aiming to yield results which offer improved reliability and validity. 16

Our integrated framework 1. Classification on single data set ◦ Standard classification ◦ Single tree, Bagging & Random Forest 2. Classification on integrated data sets ◦ Low-level data integration ◦ Single tree, Bagging & Random Forest 3. Classification on integrating models from multiple data sets ◦ High-level model integration ◦ Integrated model based on two single trees 17

Our integrated framework (con’d) 18 Training setTest set Single tree Harvard Michigan Harvard + MichiganStanford Bagging Random Forest Integrated modelSingle tree built upon Harvard + single tree built up Michigan

Available data sets NameHarvardMichiganStanford # of attributes # of samples Data typeContinuous Missing values?No Yes Class (ADEN/normal) 139/1786/1041/5 Gene typeAffymetrix ID Unknown 19 * All data sets are in.CSV format * Attribute names are denoted by gene probe names * All data sets are independent to each other

Available data sets (con’d) Harvard_Unique_probname.csv Michigan_Unique_probname.csv ◦ Two columns: Probe & Gene Symbol ◦ Mapping files: maps probe names with its corresponding gene symbols ◦ Multiple probe names may map to one gene symbol 20

Data pre-processing Gene name substitution ◦ R-programming language ◦ Remove missing values ◦ Remove duplicated genes  Remove all, including the 1 st appeared one ◦ Find out overlapping genes  Find the common gene subsets between Harvard and Michigan 21

Data pre-processing (con’d) ◦ Substitute gene symbols with probe names  H and M contain the same set of genes (not same sequence)  Stanford contains the same set, too 22

Data pre-processing (con’d) Feature selection ◦ Weka ◦ GainRatioAttributeEval > Ranker ◦ Select 100 highly ranked genes from H & M, separately ◦ 48 of them are overlapped, 52 genes are unique 23

Data pre-processing (con’d) 24 ◦ 3 parts: unique genes of H’, overlapping genes and unique genes of M’ ◦ H’, M’ and S’: with gene set of the union parts above:  H’: ‘?’s to indicate unique genes of M’  M’: ‘?’s to indicate unique gene of H’  S’: no missing values generated in this stage

Data pre-processing (con’d) Discretization ◦ Mean value ◦ R-programming language ◦ Missing values 25

Data pre-processing (con’d) Handle incompatible format ◦ ARFF format ◦ Attribute section  Same sequence of attributes  Same possible values with same sequence ◦ Data section  Values must match their corresponding data types 26

Experiments 1 Weka Explorer Build single decision trees on data sets ◦ Classify > Classifier > trees > J48 ◦ Test options > Supplied test set Build Bagging trees on data sets ◦ Classify > Classifier > meta > Bagging Build Random Forest on data sets ◦ Classify > Classifier > meta > RandomCommittee (Classifier >RandomForest) 27

Experiment 2 Matlab Build single trees upon H and M, separately For an unseen instance, do prediction on two models, ◦ if the predicted classes are the same, then keep it as it is; ◦ otherwise, the class label with greater confidence value wins. ◦ Accuracy = no. of correctly predicted / total 28

Experiment results 29

Experiments (con’d) 30

Experiments (con’d) 31

32 Major reference Babu, M. M. 2004b, “An introduction to Microarray data analysis” MRC Lab page, visited on 15 June 2008, Choi. J.K. et al. (2003) Combining multiple microarray studies and modeling interstudy variation. Bioinformatics, 19, i84-i90. Ein-Dor, L. et al. (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21, Ein-Dor, L. et al.(2006) Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. PNAS, 103, Fishel, I. et al. (2007) Meta-analysis of gene expression data: a predictor-based approach. Bioinformatics, Vol. 23, Jiang, H. et al. (2004) Joint analysis of two microarray gene expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics, 5, 81. Kuo, W.P. et al. (2002) Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics, 18, Michiels, S. et al. (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy, Lancet, 365, Rhodes, D. R. et al. (2002) Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res., 62, Van’t Veer, L.J. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, Wang, Y. et al. (2005) Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365,