Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Robust diagnosis DLBCL from gene expression data from different laboratories Dimacs Workshop, June 22, 2005 Gyan Bhanot, IBM Research.

Similar presentations


Presentation on theme: "1 Robust diagnosis DLBCL from gene expression data from different laboratories Dimacs Workshop, June 22, 2005 Gyan Bhanot, IBM Research."— Presentation transcript:

1 1 Robust diagnosis DLBCL from gene expression data from different laboratories Dimacs Workshop, June 22, 2005 Gyan Bhanot, IBM Research

2 2 Collaborators: Gabriela Alexe 1 Arnold Levine 2,3 Gustavo Stolovitzky 1 1 IBM 2 IAS 3 UMDNJ

3 3 Overview Motivation Pattern-based meta-classifiers Case study – compare data from two labs for DLBCL vs FL diagnosis

4 4 Motivation Cancer is a genetic/proteomic disease Genetic mutations/virus’s/radiation modify pathways to create survival advantage for damaged cell Gene Arrays are a way to study the variation of mRNA levels between diseased and healthy cells. This allows diagnosis and inference of pathways that cause disease

5 5 Cancer diagnosis: Input Training (biomedical / proteomic, microarray) data: k  2 classes (m samples) described by N >> features Output Collection of robust biomarkers, models Robust, accurate classifier / tested on out-of-sample data

6 6

7 7 Strategy of present paper 1. Transform original data to “pattern ” space 2. Find robust sets of biomarkers with significant collective discriminatory power 3. Use many machine learning tools on original and pattern data ANN, SVM, kNN, Weighted voting, Classification trees 4. Validate the results on data from a different lab

8 8 Patterns Observed datasetSystem response

9 9 Positive patternsNegative patterns Pattern basics

10 10 Individual classifiers used SVM, ANN, WV, KNN, CART, LR Trained / calibrated (leave-one-out): raw data pattern data

11 11 Application: Progression of Follicular Lymphoma (FL) to Diffuse Large B Cell Lymphoma (DLBCL) Gene Array data from different laboratories Shipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab) Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press) (preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab) Alexe et al (2005) Artificial Intelligence in Medicine (in press)

12 12 Non-Hodgkin lymphomas FLlow grade non-Hodgkin lymphoma t(14;18) translocation:over-expression of anti-apoptotic bcl2 25-60% FL cases evolve to DLBCL DLBCL high grade non-Hodgkin lymphoma < 2 years survival if untreated Biomarkers: FL transformation to DLBCL p53/MDM2 (Moller et al., 1999) p16 (Pyniol, 1998) p38MAPK (Elenitoba-Johnson et al., 2003) c-myc (Lossos et al., 2002)

13 13 Lymphoma datasets Data:WI (Shipp et al., 2002) Affy HuGeneFL CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2 Samples: WI: 58 DLBCL & 19 FL CU: 14 DLBCL & 7 FL Genes: WI: 6817 CU: 12581

14 14 Data Preprocessing 50 % P calls, UL = 16000, LL = 20 2/1 stratify WI data to train/test. CU data test Compute SD per gene across samples Normalize data to mean 0, SD 1 per gene Generate 500 data sets using noise + k fold stratified sampling + jacknife Find genes with high correlation to phenotype using t-test or SNR. Keep genes that are in > 450/501 of datasets

15 15 Choosing Support Sets Create good patterns using small subsets of genes, validate using weighted voting with 10 fold cross validation Sort genes by their appearance in good patterns Select top genes to cover each sample by at least 10 patterns

16 16 The 30 genes that best distinguish FL from DLBCL

17 17 Examples of FL and DLBCL patterns WI training data: Each DLBCL case satisfies at least one of the patterns P1 and P2 Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2)

18 18 Pattern data

19 19 Meta-classifier performance

20 20 Error distribution: raw and pattern data

21 21 Biology Based Methods

22 22 p53 related genes identified by filtering procedure FL  DLBCL progression

23 23 p53 pattern data

24 24 Examples of p53 responsive genes patterns WI data: Each DLBCL case satisfies one of the patterns P1, P2, P3 Each FL case satisfies one of the patterns N1, N2, N3

25 25 p53 combinatorial biomarker 77% FL & 21% DLBCL cases (3.7 fold) at most one gene over-expressed 79% DLBCL & 23% FL cases (3.4 fold) at least two genes over-expressed Each individual gene: over- expressed in about 40-70% DLBCL & 20-40% FL (specificity 50-60%, sensitivity 60-70%)

26 26 What are these genes? Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase specific cell transformation, neoplastic, drives quiescent cells into mitosis over-expressed in various human tumors Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new prognostic marker for cancer Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle, interacts with cyclins A, B3, D, E P53 tumor suppressor gene (Levine 1982)

27 27 Conclusions Pattern-based meta-classifier is robust against noise Good prediction of FL  DLBCL Biology Based Analysis also possible Yields useful Biomarker Should Study Biologically motivated sets of genes  build pathways

28 28 Thank you for your attention ! <>

29 29 Artificial neural networks

30 30 Support vector machines Find a maximum margin hyperplane in pattern space (Vapnik) (P)(D)

31 31 k-Nearest neighbors Training data : samples in normalized peptide space Prediction for test data: The dominant class of the k-nearest neighbors in Euclidean metric Positive NegativeNew case: Negative

32 32 Weighted voting Pattern data: –each pattern P is a voter –weight = fraction of correctly classified cases by the pattern –each test case: compute sum of weights of triggered positive patterns and negative patterns –classify by highest weight

33 33 Logistic regression Dataset of two phenotypes (e.g., cancer vs. non- cancer) Transform into logit space y->ln(p/1-p) Find phenotype predictor as a linear combination of data values in logit space Insightful Miner

34 34 Decision trees / forests Find rules in training data: –find root feature which best classifies samples by phenotype –iterate on each branch to find two new features which best split each branch by phenotype –if necessary prune weak support nodes CART =Classification and Regression Trees (Breiman) Many trees = forest


Download ppt "1 Robust diagnosis DLBCL from gene expression data from different laboratories Dimacs Workshop, June 22, 2005 Gyan Bhanot, IBM Research."

Similar presentations


Ads by Google