Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang In collaboration with Judith Klein-Seetharaman.

Slides:



Advertisements
Similar presentations
Analysis of Microarray Genomic Data of Breast Cancer Patients Hui Liu, MS candidate Department of statistics Prof. Eric Suess, faculty mentor Department.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarrays Dr Peter Smooker,
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Microarray GEO – Microarray sets database
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
1 Masterseminar „A statistical framework for the diagnostic of meningioma cancer“ Chair for Bioinformatics, Saarland University Andreas Keller Supervised.
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Gene Expression Analysis using Microarrays Anne R. Haake, Ph.D.
Analysis of microarray data
Gene Expression Clustering. The Main Goal Gain insight into the gene’s function. Using: Sequence Transcription levels.
Whole Genome Expression Analysis
From motif search to gene expression analysis
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle.
Finish up array applications Move on to proteomics Protein microarrays.
Microarrays.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Lecture 7. Functional Genomics: Gene Expression Profiling using
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
High-throughput omic datasets and clustering
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Genomic Signal Processing Dr. C.Q. Chang Dept. of EEE.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Classification with Gene Expression Data
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Gene Expression Classification
Functional Genomics in Evolutionary Research
Microarray Technology and Applications
Molecular Classification of Cancer
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Presentation transcript:

Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang In collaboration with Judith Klein-Seetharaman

DNA clones PCR purification Reverse transcription Robot printing Hybridize target to microarray Excitation Laser 1 Laser 2 Emission Computer analysis Label with Fluorescent dyes G. Gibson et al. Reference Treated sample Principles of cDNA microarray

Microarray data : how it looks like ? Expression level of a gene across treatments Expression profiles of genes in a certain condition Exp 1 Exp 2 Exp 3 Exp i Exp M G 1 G 2 G N-1 G N Typical examples Heat shock, G phase in cell cycle, etc …conditions Liver cancer patient, normal person, etc …samples Expression matrix

AML/ALL micro-array dataset This dataset can be downloaded from Maxtrix Each Row – a gene Each column – a patient (a sample) Each patient belong to one of two diseases types: AML(acute myeloid leukemia) or ALL (acute lymph oblastic leukemia) disease The 72 patient samples are further divided into a training set(including 27 ALLs and 11 AMLs) and a test set(including 20 ALLs and 14 AMLs). The whole dataset is over 7129 probes from 6817 human genes.

Published work on AML/ALL Classification task: gene expression -> {AML, ALL} Techniques: Support Vector Machings (SVM), Rocchio- style and logistic regression classifiers Main findings: classifiers can get a better performance when using a small subset (8) of genes, instead of thousands Implication: Many genes are irrelevant or redundant?

Possible Relationship (Hypothesis) disease Gene6 Gene8 Gene5 Gene4Gene3 Gene2 Gene1 Gene7

How can find such a structure? Find the most informative genes (“primary” ones) Statistical feature selection (brief) Find the genes related (or “similar”) to the primary ones Unsupervised clustering (detailed) based on statistical patterns of gene distributed over microarrays Bayes network for causal reasoning(future direction)

Possible Relationship (Hypothesis) disease Gene2 Gene1 Gene6 Gene8 Gene5 Gene4 Gene3 Gene7

Feature selection Choose a small subset of input variable (a few instead of genes, for example) In text categorization Features = words in documents Output variables = subject categories of a document In protein classification Features = amino acid motifs … Output variables = protein categories In genome micro-array data Features = “useful” genes Output variables = diseased or not of a patient

Feature selection on micro-array (ALM vs ALL) Golub-Slonim: GS-ranking (filtering method) Ben-Dor TNoM-ranking (filtering method) Isabelle-Guyon: Recursive SVM(Wrapper method) Selected 8 genes (out of in that dataset) Accuracy 100% Our work (Fan & Yiming) (best) Selected 3 genes (using Ridge regression) Accuracy 100%

Feature selection experiments already done in this micro-array data The 3 genes we found Id1882: CST3 Cystatin C(amyloid angiopathy and cerebral hemorrhage) M27891_at Id6201: INTERLEUKIN-8PRECURSOR Y00787_at Id4211: VIL2 Villin 2(ezrin) X51521_at

Some analysis on the result we get The first two genes are strongly correlated with each other. The third gene is very different from the first two genes. 1 st gene + 2 nd gene is bad (10/34 errors) 1 st gene + 3 rd gene is good (1/34 error)

Question:As the next step, Can we find more gene-gene relationship? Several techniques available: Clustering Bayesian network learning Independent component analysis …

Clustering Analysis in micro-array data Clustering methods have already been widely used to find similar genes or common binding sites from micro-array data. A lot of different clustering algorithms… Hierarchical clustering K-means SOM CAST ……

A example of hierarchical clustering analysis(from Spellman et al.)

Our clustering experiment on AML/ALL dataset Our clustering result is over the top 1000 genes most relevant to the disease.

The feature-selection curve

Our clustering result in the top 1000 genes

Some analysis to the clustering result The first two genes are always clustered in the same cluster (in hierarchical clustering, they are in cluster 1. In k-means clustering, they are in cluster 2) The third gene is always not clustered in the same group with the first two genes (in hierarchical clustering, it is in cluster 23. In k-means clustering, it is in cluster 1) This validates our previous analysis.

Disadvantage of Clustering However… It can not find out the internal relationship inside one cluster It can not find the relationship between clusters genes connected to each other may not be in the same cluster. Clustering vs Bayesian network learning (copied from David K,Gifford, Science, VOL293, Sept,2001)

A counter example of clustering analysis

Bayesian network learning Thus Bayesian network seems a much better technique if we want to model the relationship among genes. Researcher have done experiments and constructed bayesian networks from micro- array data. They found there are a few genes which have a lot of connections with other genes. They use prior biology knowledge to validate their learned edges(interactions between genes and found they are reasonable)

A example of the bayesian network Part of the bayesian network Nir Friedman constructed. There are total 800 genes(nodes) in the graph. These 800 genes are all cell-cycle regulated genes.

Our plan in genetic regulatory network construction There are several possible ways Using feature selection technique to make the network learning task more robust and with less computational cost. Learning gene regulatory networks on microarray dataset with disease labels(thus we may find pathways relevant to specific disease). Using ICA to finding hidden variables(hidden layers) and check its consistency with bayes network learning result.

Our plan in genetic regulatory network construction Use prior prior biology knowledge in gene network,like the “network motifs”. The following example is copied from Shai S.Shen-Orr, Naturtics,genetics, Previous network learning algorithm have not considered those characters.

Reference Using Bayesnetwork to analyze Expression Data, Nir Friedman, M.Linial, I.Nachman, Journal of Computational Biology, 7: , Gene selection for cancer classification using support vector machines. Guyon,I.et al. Machine Learning,46, Clustering analysis and display of genome-wide expression patterns, Eisen,M.B. et al. PNAs, 95: , 1998 Clustering gene expression patterns. Ben-Dor, A.,Shamir,R., and Yakini,Z., Computational Biology, 6(3/4): , 1999.