Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Basic Gene Expression Data Analysis--Clustering
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
T. R. Golub, D. K. Slonim & Others Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer.
x – independent variable (input)
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Microarray Data Preprocessing and Clustering Analysis
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E Transcriptional Control in Eukaryotes Background Information Microarrays.
Ensemble Learning: An Introduction
. Differentially Expressed Genes, Class Discovery & Classification.
Discrimination Methods As Used In Gene Array Analysis.
Alizadeh et. al. (2000) Stephen Ayers 12/2/01. Clustering “Clustering is finding a natural grouping in a set of data, so that samples within a cluster.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001
For Better Accuracy Eick: Ensemble Learning
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesole b,c and Nicola.
Gene expression profiling identifies molecular subtypes of gliomas
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek.
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Chapter 7 Essential Concepts in Molecular Pathology Companion site for Molecular Pathology Author: William B. Coleman and Gregory J. Tsongalis.
Functional genomics + Data mining BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
More on Microarrays Chitta Baral Arizona State University.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Microarrays.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Prof. Yechiam Yemini (YY) Computer Science Department Columbia University (c)Copyrights; Yechiam Yemini; Lecture 2: Introduction to Paradigms 2.3.
The Broad Institute of MIT and Harvard Differential Analysis.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Functional genomics + Data mining
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Molecular Classification of Cancer
PCA, Clustering and Classification by Agnieszka S. Juncker
Focus on lymphomas Cancer Cell
Cluster Analysis in Bioinformatics
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Microarray Gene Expression Analysis of Fixed Archival Tissue Permits Molecular Classification and Identification of Potential Therapeutic Targets in Diffuse.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Examples of Classifying Expression Data / 7.90 Computational Functional Genomics Spring 2002

Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation Tamayo, Slonim, Mesirov, Zhu, Kitareewan, Dmitrovsky, Lander, Golub PNAS 96, pp , March 1999

Hierarchical clustering problems Not designed to reflect multiple ways expression patterns can be similar Clusters not be robust or unique Points can be clustered based on local decisions that lock in structure

Self-Organizing Maps (SOMs) Mathematical space for SOMs –n genes with k samples define n points in k- dimensional space Impose partial structure on the clusters to start –Choose a geometry of nodes – e.g. 3 x 2 grid –Mapped into k dimensional space at random –Each iteration moves nodes in direction of a randomly selected point –Closest node is moved the most –20,000 – 50,000 iterations later have clustered the genes

Example SOM iteration

Iterative point moving f i+1 (N) = f i (N) + L( d(N, N p ), i) (P – f i (N)) P is observation used in iteration i to update map points N map point being updated N p is closest point in map to P Learning rate L decreases with distance and i T is total number of iterations L(x, i) = 0.02T / (T + 100i ) for x <= p(i) L(x, i) = 0 otherwise p(i) decreases linearly with i p(0) = 3

Data normalization Genes were eliminated if they did not change significantly (eliminate attraction to invariant genes) Expression levels are normalized to have mean 0 and variance 1 (focus on shape) Yeast data – levels were normalized within each of the two cell cycles Human data – expression levels were normalized within the time points

SOM computation Computation time is about 1 minute; 20,000 – 50,000 iterations for 416 to 1,036 genes Web based interface used to visualize the data Average expression pattern is displayed with error bars Can also overlay members of a cluster on a single plot Yeast cell cycle –6 x 5 SOM –416 genes –Computed in 82 seconds

Cluster 29 detail – 76 members exhibiting periodic behavior in late G1

G1, S, G2, and M phase related clusters (C29, C14, C1, C5)

Centroids for groups of genes identified by visual inspection by Cho et. al.

PMA treated HL-60 cells SOM 567 genes passing the variation filter were grouped into a 4 x 3 SOM PMA causes macrophage differentiation ( PMA = phorbol 12-myristate 13- acetate.) Cluster 11 – PMA induced genes

Hematopoietic differentiation across four cells lines HL-60 U937 Jurkat NB4 n = 17 1,036 genes 6 x 4 SOM

SOM conclusion Successful at finding new structure Inspection still necessary to find insights Able to recover temporal response to perturbation Can provide richer topology than linear ordering However, topology needs to be provided in advance

Plan Overview of classification techniques Mixture Model Clustering –Alon - Colon tumors Weighted Voting of Selected Genes –Golub – Leukemia (ALL, AML) Hierarchical Clustering –Alizadeh – Diffuse large B-cell lymphoma

Statistical Pattern Recognition A classifier is an algorithm that assigns an observation to a class A class can be a letter (handwriting recognition), a person (face recognition), a type of cell, a diagnosis, or a prognosis Data set -- data with known classes for training Generalize data set knowledge to new observations Classification is based on features Feature selection is key

Model Complexity A model describes a data set and is used to make future decisions If a model is too simple it gives a poor fit to the data set If a model is too complex, it gives a poor representation of the systematic aspects of the data (overfit to data set)

Types of classifiers Discriminative –No assumptions about underlying model Generative –Assumptions made about form of underlying model (e.g. variables are Gaussian) –Assumptions cause performance advantages – and disadvantages if the assumptions are incorrect

Mixture Models for Clustering Alon, U et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, PNAS 96, pp (June 1999)

Problem Definition 40 colon adenocarcinoma biopsy specimens 22 normal tissue specimens Cell lines derived from colon carcinoma (EB and EB-1) Can we tell the cancer specimens from the normal specimens by expression analysis?

Gene Pair Correlations Dashed line is correlation with data set randomized 10 4 times Shaded area P < Each gene: 30 genes sig. positive correlation. 10 genes sig. negative correlation.

Mixture Model Each gene is represented by a vector that has been normalized so that its sum is 0 and the magnitude is 1 Mixture model used assumes two distributions with centroids C j P j (V k ) is probability that V k is in class j C j =  k V k P j (V k ) /  k P j (V k )

Mixture Model is used for top down clustering At end of iteration, each gene is assigned to the cluster with the highest probability Makes hard boundary between clusters Repeat process on both subclusters Both genes and tissues are clustered using the same algorithm

Results of clustering algorithm

Excerpt from ribosomal gene cluster

Expanded view of clustering Tumor tissues have arrows at left ** are EB and EB1 colon carcinomia cell lines

Five of 20 most informative genes are muscle genes Muscle index is normalized average intensity of 17 muscle related ESTs

Sensitivity of clustering to genes used Genes sorted by t test

Conclusion Epithelial origin tumors distinguished from muscle-rich normal tissue samples Tumor cell lines distinguished Need tissue purity of in vivo samples

Weighted Voting for Classification Golub,T. et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286, pp , October 15, 1999.

Two challenges Class discovery – defining previously unrecognized tumor subtypes Class prediction – assignment of tumor samples to already defined classes

Data source 38 bone marrow samples –27 acute myeloid leukemia (AML) –11 acute lymphoblastic leukemia (ALL) Hybridized to Affymetrix arrays –6817 human genes

Classifier architecture

Pick informative feature set

Correlation function All variables are first log transformed g is a vector of samples [e 1.. e n ] c tells us the class of each sample [ ] Thus we can compute  1 (g)  2 (g)  1 (g)  2 (g) P(g,c) = (  1 (g) -  2 (g)) / (  1 (g) +  2 (g)) N 1 (c,r) all genes g such that (P(g,c) = r) N 2 (c,r) all genes g such that (P(g,c) = -r)

~1100 genes are informative -- number of genes within neighborhoods

Weighted voting for features

Weighted voting v i = (x i – (  aml +  all )/2) w i = P(g,c) Total votes –Class 1 – sum all positive w i v i –Class 2 – sum all negative w i v i

Prediction Strength PS = (V win – V lose )/(V win + V lose ) V win and V lose are vote totals for winning and losing classes, respectively Gives a “margin of victory” Sample assigned to winning class if PS > 0.3

Performance of 50 gene predictor – 100% accuracy

Genes most correlated with AML/ALL class distinction

Feature sets All predictors that used between 10 and 200 genes were 100% accurate

Using SOM to discover classes

Bayesian perspective Assuming –Class distributions are normal with equal variances Weight for a gene is (  1 -  2 ) /  2

Conclusion Can classify AML and ALL with as little as 10 genes “Many other gene selection metrics could be used; we considered several …. The best performance was obtained with the relative class separation metric defined above”

Discovering new types of cancer Alizadeh, A., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403, pp (February 3, 2000)

Goal Discover cause for different disease courses for diffuse large B-cell lymphoma (DLBCL) –40% of patients respond to therapy –60% succumb to disease Provide diagnostic / prognostic tool DLBCL is most common subtype of non- Hodgkin’s lymphoma

Questions Can we create a molecular portrait of distinct types of B-cell malignancy? Can we identify types of malignancy not yet recognized? Can we relate malignancy to normal stages in B-cell development and physiology?

Lymphochip 17,856 cDNA clones –12,069 from germinal B-cell library –2,338 from DLBCL, follicular lymphoma (FL), mantle cell lymphoma, and chronic lymphocytic leukaemia (CLL) –3,186 genes important to lymphocyte and/or cancer biology –B- and T-lymphocyte genes that respond to mitogens or cytokines

Data sources Rearranged immunoglobulin genes in DLBCL are characteristic of germinal center of secondary lymphoid organs 96 normal and malignant lymphocyte samples

Lymphochip cluster

DLBCL subtypes visible

Feature discovery A: cluster Germinal Center B genes and samples B: cluster more genes, use A’s sample cluster C: expanded view of B

DLBCL vs. normal B-lymphocyte differentiation

Distinct DLBCL groups by gene expression profiling A: gene expression B: IPI C: IPI 0-2; gene expression

Summary DLBCL groups are still diverse – some members of GC B-like DLBCL group die –5 in first 2 years May be able to find informative features for more groups If can find constitutive genes in cancers, target upstream regulators