Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.

Slides:



Advertisements
Similar presentations
Mining Association Rules from Microarray Gene Expression Data.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Show & Tell Limsoon Wong KRDL Datamining: Turning Biological Data into Gold.
Microarrays Dr Peter Smooker,
Microarray analysis Golan Yona ( original version by David Lin )
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong Deputy Executive Director (Research) I 2 R: Imagination.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Gene Expression Based Tumor Classification Using Biologically Informed Models ISI 2003 Berlin Claudio Lottaz und Rainer Spang Computational Diagnostics.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
Gene Expression Analysis using Microarrays Anne R. Haake, Ph.D.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
Whole Genome Expression Analysis
Structured Analysis of Microarrays & Differential Coexpression Claudio Lottaz, Dennis Kostka & Rainer Spang Courses in Practical DNA Microarray Analysis.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 3 February.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Microarray - Leukemia vs. normal GeneChip System.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 3, May 2004 For written notes.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
The Broad Institute of MIT and Harvard Differential Analysis.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Learning disjunctions in Geronimo’s regression trees Felix Sanchez Garcia supervised by Prof. Dana Pe’er.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician, CS2220:
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Raphael Sandaltzopoulos, PhD, MBA Professor at MBG (Molecular Biology) Lab. of Gene Expression, Molecular Diagnosis and Modern Therapeutics,
Show & Tell Limsoon Wong Kent Ridge Digital Labs Singapore Role of Bioinformatics in the Genomic Era.
David Amar, Tom Hait, and Ron Shamir
Classification with Gene Expression Data
An Artificial Intelligence Approach to Precision Oncology
Microarray - Leukemia vs. normal GeneChip System.
Gene expression.
Microarray Technology and Applications
Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong
Volume 1, Issue 2, Pages (March 2002)
Somi Jacob and Christian Bach
Presentation transcript:

Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm Research

Copyright  2003 limsoon wong Plan Some accomplishments and challenges in knowledge discovery from biological and clinical data Data mining in microarray analysis –diagnosis of disease state and subtype –derivation of treatment plan –understanding of gene interaction network

Copyright  2003 limsoon wong Knowledge Discovery from Biological and Clinical Data: MOTIVATION

Copyright  2003 limsoon wong Complete genomes are now available Knowing the genes is not enough to understand how biology functions Proteins, not genes, are responsible for many cellular activities Proteins function by interacting with other proteins and biomolecules GENOME PROTEOME INTERACTOME Driving Forces: Genes, Proteins, Interactions, Diagnosis, & Cures

Copyright  2003 limsoon wong If we figure out how these work, we get these Benefits To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

Copyright  2003 limsoon wong To figure these out, we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Copyright  2003 limsoon wong Knowledge Discovery from Biological and Clinical Data: ACCOMPLISHMENT

Copyright  2003 limsoon wong Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics ISS KRDL LIT/I 2 R GeneticXchange Molecular Connections Biobase 8 years of bioinformatics R&D in Singapore

Copyright  2003 limsoon wong Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets is slow and expensive process Develop systems to recognize protein peptides that bind MHC molecules Develop systems to recognize hot spots in viral antigens

Copyright  2003 limsoon wong Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives

Copyright  2003 limsoon wong Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for accurate diagnosis  Curable in USA,  fatal in Indonesia A single platform diagnosis based on gene expression Data mining to discover rules that are easy for doctors to understand

Copyright  2003 limsoon wong Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions

Copyright  2003 limsoon wong Data Mining in Microarray Analysis: MICROARRAY BACKGROUND

Copyright  2003 limsoon wong What’s a Microarray? Contain large number of DNA molecules spotted on glass slides, nylon membranes, or silicon wafers Measure expression of thousands of genes simultaneously

Copyright  2003 limsoon wong Affymetrix GeneChip Array

Copyright  2003 limsoon wong Making Affymetrix GeneChip quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules exposed linkers become deprotected and are available for nucleotide coupling

Copyright  2003 limsoon wong Gene Expression Measurement by GeneChip

Copyright  2003 limsoon wong A Sample Affymetrix GeneChip File (U95A)

Copyright  2003 limsoon wong Data Mining in Microarray Analysis: DISEASE SUBSTYPE DIAGNOSIS

Copyright  2003 limsoon wong Pediatric Acute Lymphoblastic Leukemia A heterogeneous disease with more than 12 subtypes, e.g., T-ALL, E2A-PBX1, TEL- AML1, BCR-ABL, MLL, and Hyperdip>50. Treatment response is subtype dependent 80% continuous remission if subtype is correctly diagnosed and the corresponding treatment plan is applied

Copyright  2003 limsoon wong Subtype Diagnosis Require different tests: –immunophenotyping –cytogenetics –molecular diagnostics Require different experts: –hematologist –oncologist –pathologist –cytogeneticist

Copyright  2003 limsoon wong Difficulties and Implications The different tests and experts are not commonly available within a single hospital, especially in less advanced countries  An 80%-curable disease in USA can be a fatal disease in Indonesia!  Is there a single diagnostic platform that does not need multiple human specialists?

Copyright  2003 limsoon wong A Potential Solution by Microarrays Yeoh et al., Cancer Cell 1: , 2002 Diagnostic ALL BM samples (n=327) 33 -3  -2  -1  0 11 22  = std deviation from mean Genes for class distinction (n=271) TEL-AML1BCR- ABL Hyperdiploid >50E2A- PBX1 MLLT-ALLNovel TEL-AML1 E2A-PBX1 T-ALL Hyperdiploid >50 BCR-ABL MLL Novel

Copyright  2003 limsoon wong Some Caveats Study was performed on Americans May not be applicable to Singaporeans, Malaysians, Indonesians, etc. Large-scale study on local populations currently in the works

Copyright  2003 limsoon wong Typical Procedure in Analysing Gene Expression for Diagnosis Gene expression data collection Gene selection Classifier training Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases

Copyright  2003 limsoon wong Feature Selection Methods A refresher of feature selection methods

Copyright  2003 limsoon wong Signal Selection (Basic Idea) Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Copyright  2003 limsoon wong Signal Selection (eg., t-statistics)

Copyright  2003 limsoon wong Signal Selection (eg.,  2)

Copyright  2003 limsoon wong Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS –Correlation-based Feature Selection –A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Copyright  2003 limsoon wong Gene Expression Profile Classification An introduction to gene expression profile classification by the example on ALL subtype diagnosis

Copyright  2003 limsoon wong Subtype Classification of ALL A tree-structured diagnostic workflow was recommended by the doctors, as per Yeoh et al., Cancer Cell 1: , 2002

Copyright  2003 limsoon wong Training and Testing Sets

Copyright  2003 limsoon wong Our procedure for ALL subtype diagnosis Gene expression data collection Gene selection by entropy Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases by PCL

Copyright  2003 limsoon wong Signal Selection (eg., entropy)

Copyright  2003 limsoon wong Emerging Patterns (EPs) An EP is a set of conditions –usually involving several features –that most members of a class satisfy –but none or few of the other class satisfy A jumping EP is an EP that –some members of a class satisfy –but no members of the other class satisfy We use only most general jumping EPs

Copyright  2003 limsoon wong PCL: Prediction by Collective Likelihood

Copyright  2003 limsoon wong Accuracy (using 20 genes of lowest entropy) PCL 1:1 0:2 5 1:1 0:0 4 0:1 1:1 1:6 14 0:1 1:1 5 0:1 2:2 7

Copyright  2003 limsoon wong Comprehensibility

Copyright  2003 limsoon wong Gene Expression Profile Classification How about other feature selection and classification methods?

Copyright  2003 limsoon wong Some gene selection heuristics all-CFS: all features from CFS top20-  2: 20 features w/ highest  2 stats top20-t: 20 features w/ highest t-stats top20-mit: 20 features w/ highest MIT stats entropy: 20 features w/ lowest entropy all-  2: all features meeting 5% significance level of  2 stats

Copyright  2003 limsoon wong Some other classification methods k-NN (k=1) –majority votes of the k nearest neighbours determined by Euclidean distance C4.5 –widely used decision tree method. Naïve Bayes (NB) –probabilistic prediction using Bayes’ rule SVM –(linear) discriminant function that maximizes separation of boundary samples

Copyright  2003 limsoon wong Accuracy Feature selection improves performance Entropy+PCL has consistent high performance

Copyright  2003 limsoon wong When 20 genes are selected randomly Average over 100 experiments Cf mistakes total with good feature selection

Copyright  2003 limsoon wong Data Mining in Microarray Analysis: TREATMENT PLAN DERIVATION A pure speculation!

Copyright  2003 limsoon wong Can we do more with EPs? Detect gene groups that are significantly related to a disease Derive coordinated gene expression patterns from these groups Derive “treatment plan” based on these patterns

Copyright  2003 limsoon wong Colon Tumour Dataset Alon et al., PNAS 96: , 1999 We use the colon tumour dataset above to illustrate our ideas –22 normal samples –40 colon tumour samples

Copyright  2003 limsoon wong Detect Gene Groups Feature Selection –Use entropy method –35 genes have cut points Generate EPs –19501 EPs in normals –2165 EPs in tumours EPs with largest support are gene groups significantly co-related to disease

Copyright  2003 limsoon wong Top 20 EPs

Copyright  2003 limsoon wong Observation 1 Some EPs contain large number of genes and still have high freq E.g., {2, 3, 6, 7, 13, 17, 33} has freq 90.91% in normal and 0% in cancer samples  Nearly all normal sample’s gene expr. values satisfy all conds. implied by these 7 items

Copyright  2003 limsoon wong Observation 2 Freq of singleton EP is not necessarily larger than EP having multiple genes E.g., {5} is EP in cancer samples and has freq 32.5% E.g., {16, 58, 62} is EP in cancer samples and has freq 75.5%  Groups of genes and their correlation's could be more impt than single genes

Copyright  2003 limsoon wong Observation 3 M33680 has lowest entropy of the 35 genes if cutpoint is set at /40 of cancer samples shift expr level of M33680 from its normal range to its abnormal range

Copyright  2003 limsoon wong Treatment Plan Idea Increase/decrease expression level of particular genes in a cancer cell so that –it has the common EPs of normal cells –it has no common EPs of cancer cells

Copyright  2003 limsoon wong Treatment Plan Example From the EP {2,3,6,7,13,17,33} –91% of normal cells express the 7 genes (T51560, T49941, M62994, R34701, L02426, U20428, R10707) in the corr. Intervals –a cancer cell never express all 7 genes in the same way –if expression level of improperly expressed genes can be adjusted, the cancer cell can have one common EP of normal cells –a cancer cell can then be iteratively converted into a normal one

Copyright  2003 limsoon wong Choosing Genes to Adjust

Copyright  2003 limsoon wong Doing more adjustments... Down regulating T49941 leads to 2 more top 10 EPs of normal cells to show up in the adjusted T1 Down regulating X62153 to below 396 and T72403 to below 296 leads to T1 having 9 top 10 EPs of normal cells Ave. no. of EPs in normal cells is 9 So the adjusted T1 now has impt features of normal cells

Copyright  2003 limsoon wong Next, eliminate common EPs of cancer cells in T1 6 more genes (K03001, T49732, U29171, R76254, D31767, L40992) are adjusted All top 10 EPs of cancer cells now disappear from T1 Ave. no. of top 10 EPs contained in cancer cells is 6 The adjusted T1 now holds enough common features of normal cells and no features of cancer cells  T1 is converted to normal cells

Copyright  2003 limsoon wong “Treatment Plan” Validation “Adjustments” were made to the 40 colon tumour samples based on EPs as described Classifiers trained on original samples were applied to the adjusted samples It works!

Copyright  2003 limsoon wong A Big But... Effective means for identifying mechanisms and pathways through which to modulate gene expression of selected genes need to be developed

Copyright  2003 limsoon wong Data Mining in Microarray Analysis: GENE INTERACTION PREDICTION

Copyright  2003 limsoon wong Beyond Classification of Gene Expression Profiles After identifying the candidate genes by feature selection, do we know which ones are causal genes and which ones are surrogates? Diagnostic ALL BM samples (n=327) 33 -3  -2  -1  0 11 22  = std deviation from mean Genes for class distinction (n=271) TEL-AML1BCR- ABL Hyperdiploid >50E2A- PBX1 MLLT-ALLNovel

Copyright  2003 limsoon wong Gene Regulatory Circuits Genes are “connected” in “circuit” or network Expression of a gene in a network depends on expression of some other genes in the network Can we reconstruct the gene network from gene expression data?

Copyright  2003 limsoon wong Key Questions For each gene in the network: which genes affect it? How they affect it? –Positively? –Negatively? –More complicated ways?

Copyright  2003 limsoon wong Some Techniques Bayesian Networks –Friedman et al., JCB 7: , 2000 Boolean Networks –Akutsu et al., PSB 2000, pages Differential equations –Chen et al., PSB 1999, pages Classification-based method –Soinov et al., “Towards reconstruction of gene network from expression data by supervised learning”, Genome Biology 4:R6.1--9, 2003

Copyright  2003 limsoon wong A Classification-based Technique Soinov et al., Genome Biology 4:R6.1-9, Jan 2003 Given a gene expression matrix X –each row is a gene –each column is a sample –each element x ij is expression of gene i in sample j Find the average value a i of each gene i Denote s ij as state of gene i in sample j, –s ij = up if x ij > a i –s ij = down if x ij  a i

Copyright  2003 limsoon wong To see whether the state of gene g is determined by the state of other genes – we see whether  s ij | i  g  can predict s gj –if can predict with high accuracy, then “yes” –Any classifier can be used, such as C4.5, PCL, SVM, etc. To see how the state of gene g is determined by the state of other genes –apply C4.5 (or PCL or other “rule-based” classifiers) to predict s gj from  s ij | i  g  –and extract the decision tree or rules used A Classification-based Technique Soinov et al., Genome Biology 4:R6.1-9, Jan 2003

Copyright  2003 limsoon wong Advantages of this method Can identify genes affecting a target gene Don’t need discretization thresholds Each data sample is treated as an example Explicit rules can be extracted from the classifier (assuming C4.5 or PCL) Generalizable to time series

Copyright  2003 limsoon wong Acknowledgements Vladimir Bajic See-Kiong Ng Vladimir Brusic Huiqing Liu Jinyan Li

Copyright  2003 limsoon wong Data Mining in Microarray Analysis: NOTES

Copyright  2003 limsoon wong References J.Li, L. Wong, “Geography of differences between two classes of data”, Proc. 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pp , 2002 J.Li, L. Wong, “Identifying good diagnostic genes or gene groups from gene expression data by using the concept of emerging patterns”, Bioinformatics, 18: , 2002 J.Li et al., “A comparative study on feature selection and classification methods using a large set of gene expression profiles”, GIW, 13:51--60, 2002

Copyright  2003 limsoon wong References E.-J. Yeoh et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling”, Cancer Cell, 1: , 2002 U.Alon et al., “Broad patterns of gene expression revealed by clustering analysis of tumor colon tissues probed by oligonucleotide arrays”, PNAS 96: , 1999 L.A.Soinov et al., “Towards reconstruction of gene networks from expression data by supervised learning”, Genome Biology 4:R6.1--9, 2003.

Copyright  2003 limsoon wong > Data Mining of Gene Expression Profiles for > the Diagnosis and Understanding of Diseases > > This talk is divided into two parts. In Part I, I will provide a > brief overview of some accomplishments and challenges > in Bioinformatics. In Part II, I will discuss the data mining > in the analysis of microarray gene expression profiles for > (a) diagnosis of disease state or subtype, (b) derivation of > disease treatment plan, and (c) understanding of gene > interaction networks. >