Wenting Zhou, Weichen Wu, Nathan Palmer, Emily Mower, Noah Daniels, Lenore Cowen, Anselm Blumer Tufts University Microarray Data.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Microarrays Dr Peter Smooker,
Mutual Information Mathematical Biology Seminar
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Reduced Support Vector Machine
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Evaluation.
. Differentially Expressed Genes, Class Discovery & Classification.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
with an emphasis on DNA microarrays
Evaluating Performance for Data Mining Techniques
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Gene expression profiling identifies molecular subtypes of gliomas
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Whole Genome Expression Analysis
Efficient Model Selection for Support Vector Machines
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Diagnosis of multiple cancer types by shrunken centroids of gene expression Course: Topics in Bioinformatics Presenter: Ting Yang Teacher: Professor.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Microarrays.
1 Decision tree based classifications of heterogeneous lung cancer data Student: Yi LI Supervisor: Associate Prof. Jiuyong Li Data: 15 th May 2009.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
NCode TM miRNA Analysis Platform Identifies Differentially Expressed Novel miRNAs in Adenocarcinoma Using Clinical Human Samples Provided By BioServe.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Cluster Analysis II 10/03/2012.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Molecular Classification of Cancer
Clustering.
Clustering.
Presentation transcript:

Wenting Zhou, Weichen Wu, Nathan Palmer, Emily Mower, Noah Daniels, Lenore Cowen, Anselm Blumer Tufts University Microarray Data Analysis of Adenocarcinoma Patients’ Survival Using ADC and K-Medians Clustering

Overview Goals Introduction Explanation of ADC and NSM Explanation of MVR, K-Medians, and Hierarchical Clustering Results Conclusions

Goals Start with a classification of patients into high-risk and low-risk clusters Obtain a small subset of genes that still leads to good clusters These genes may be biologically significant One can use statistical or machine learning techniques on the reduced set that would have led to overfitting on the original set

Introduction to microarrays Photolithographic techniques used in integrated circuit fabrication can be adapted to construct arrays of thousands of genes Diagram courtesy of affymetrix.com

Monitoring gene expression values with microarrays mRNA from tissue sample is transcribed to cDNA cDNA is labelled with markers and fragmented Labelled cDNA hybridizes to DNA on microarray Microarray is scanned with ultraviolet light, causing the markers to flouresce Diagram courtesy of affymetrix.com

Adenocarcinoma data sets We applied clustering and dimension-reduction techniques to gene expression values and survival times of patients with lung adenocarcinomas Harvard Data (n=84) Michigan Data (n=86) 12,600 genes 7129 genes

Overview Goals Introduction Explanation of ADC and NSM Explanation of MVR, K-Medians, and Hierarchical Clustering Results Conclusions

ADC and NSM Overview We use Approximate Distance Clustering maps (Cowen, 1997) to project the data into one or two dimensions so we can use very simple clustering techniques. Then we use Nearest Shrunken Mean (Tibshirani, 1999) to reduce the number of genes used to predict the clusters. We evaluate using leave-one-out crossvalidation and log-rank tests

Approximate Distance Clustering (ADC, Cowen 1997) Approximate Distance Clustering is a method that reduces the dimensionality of the data. This is done by calculating the distance from each datapoint to a subset of the data, which is called a witness set. A different witness set is used for each desired dimension A simple clustering technique is used on the projected data

ADC map in one dimension

1-d ADC map with cutoff

General ADC Definition Choose witness sets D 1, D 2, …, D q to be subsets of the data of sizes k 1, k 2, …, k q The associated ADC map f (D1, D2, …, Dq) : R p  R q maps a datapoint x to (y 1, y 2, …, y q ) where y i = min{ || x j – x || : x j  D i } is the distance to the closest point in D i

Criterion for a good clustering Compute the Kaplan-Meier survival curves and the p-value from the log-rank test, then choose the clustering that minimizes: W = 4000*a *b + 450*(1-c)+ 50*d where a=1 if the size of the smaller group < n/8 and 0 otherwise b is the p-value c is the difference between the final survival rates of the low-risk and high-risk groups d is the high-risk group’s final survival rate

Kaplan-Meier Curve Example

Nearest Shrunken Mean (NSM) Gene Reduction (Tibshirani,1999) NSM eliminates genes with cluster means close to the overall mean. NSM shrinks the cluster means toward the overall mean by an amount proportional to the within-class standard deviations for each gene. If the cluster means all reach the overall mean, that gene can be eliminated.

Definition of NSM This gene would be retained Overall mean for gene i Cluster 1 mean Cluster 2 mean (s 0 +s i ) m 1 ∆(s 0 +s i ) m 2 ∆

Definition of NSM This gene would also be retained Overall mean for gene i Cluster 1 mean Cluster 2 mean (s 0 +s i ) m 1 ∆(s 0 +s i ) m 2 ∆

Definition of NSM This gene would be eliminated Overall mean for gene i Cluster 1 mean Cluster 2 mean (s 0 +s i ) m 1 ∆(s 0 +s i ) m 2 ∆

Overview Goals Introduction Explanation of ADC and NSM Explanation of MVR, K-Medians, and Hierarchical Clustering Results Conclusions

MVR and K-Medians Overview We use naïve clustering by survival time instead of ADC for the initial clusters We use variance ratios instead of NSM We reduce genes further using hierarchical clustering of expression profiles We evaluate using K-medians and log-rank tests

Method: Minimum Variance Ratio (MVR) Gene Reduction The variance ratio is the sum of the within-cluster variances divided by the total variance of expression values for that gene. Genes with large variance ratios are thought to contribute less to the cluster definitions and are eliminated.

Hierarchical Clustering of Genes Different genes may have similar expression profiles Eliminating similar genes may lead to a smaller set of genes that still leads to a good separation into high-risk and low-risk clusters Hierarchically cluster the genes until the desired number of clusters is obtained, then select one gene from each cluster

K-Medians Clustering This unsupervised clustering method finds the K datapoints that are the best cluster centers In this paper we use K=2 so it is possible to find the optimal clustering by trying all possible pairs of points as cluster centers. The quality of the clustering is calculated as the total distance of data points to their cluster centers

Overview Goals Introduction Explanation of ADC and NSM Explanation of MVR, K-Medians, and Hierarchical Clustering Results Conclusions

Experimental Results The following tables give the results obtained when using the W criterion to select the best ADC witnesses and cutoffs, then reducing the set of probesets with NSM. The p-values were obtained from leave-one-out crossvalidation on the reduced set of probesets. The values for STCC were obtained by following the same procedure but substituting clusters formed from the 50% or 60% highest risk patients for the ADC clusters.

Comparison of 1-d and 2-d ADC with STCC on Michigan data (n = 86)

Kaplan-Meier Curve (p=.0009)

Comparison of 1-d and 2-d ADC with STCC on Harvard data (n = 84)

Kaplan-Meier Curve (p=.0332)

Results for unique genes Since multiple probesets correspond to the same genes, we repeated the same procedure for the top 50 distinct genes On the Michigan data, this gave p= On the Harvard data, this gave p=0.0331

Kaplan-Meier curve for top 50 genes from Michigan data

Kaplan-Meier curve for top 50 genes from Harvard data

Validating ADC Between Michigan and Harvard Data We validated the 50 genes we obtained from the Michigan data by finding the genes in the Harvard data that matched by gene symbol and using those to run leave-one-out crossvalidation on the Harvard data. For the 1-dimensional ADC, we found 48 matching genes in the Harvard data and obtained a p-value of

Kaplan-Meier Curve (p=.0254)

Validating ADC Between Michigan and Harvard Data We validated the 50 genes we obtained from the Harvard data by finding the genes in the Michigan data that matched by gene symbol and using those to run leave-one-out crossvalidation on the Michigan data. We validated the 50 genes we obtained from the Harvard data by finding the genes in the Michigan data that matched by gene symbol and using those to run leave-one-out crossvalidation on the Michigan data. For the 1-dimensional ADC, we found 42 matching genes in the Michigan data and obtained a p-value of For the 1-dimensional ADC, we found 42 matching genes in the Michigan data and obtained a p-value of

Kaplan-Meier Curve (p=.0307)

Some cancer-related genes found on our top-50 list, but not found in the Michigan top-50 SPARCL1 (also known as MAST9 or hevin) - down regulation of SPARCL1 also occurs in prostate and colon carcinomas, suggesting that SPARCL1 inactivation is a common event not only in NSCLCs but also in other tumors of epithelial origin. CD74 - well-known for expression in cancers PRDX1 - linked to tumor prevention PFN2 - seen as increasing in gastric cancer tissues SFTPC - responsible for morphology of the lung; a mutation causes chronic lung disease HLA-DRA (HLA-A) - lack of expression causes cancers

MVR and K-Medians results We used Minimal Variance Ratio to select 200 genes from the Michigan and Harvard data based on an initial clustering according to survival times. We then used hierarchical clustering to group these genes into 40 clusters. We selected one gene from each cluster and performed a K-medians clustering of the patients into a high-risk and low-risk group using these 40 genes after normalizing their expression profiles so that the clusters wouldn’t be unduly influenced by genes with high mean expression values.

MVR and K-Medians results On the Michigan data this gave a p-value of with cluster sizes of 36 and 50, while on the Harvard data the p-value was with cluster sizes of 47 and 37. We used leave-one-out crossvalidation to verify this whole procedure. After clustering, the remaining patient was classified as high-risk or low-risk according to which cluster had the smaller average distance to that patient. For the Michigan data, this gave a p-value of and for the Harvard data the p-value was

Kaplan-Meier Curve (p=.00002)

Kaplan-Meier Curve (p=.0417)

Conclusion Combinations of simple techniques yield small sets of genes with high predictive power Different techniques give different sets of genes ADC - NSM was often superior to MVR - K-medians - Hierarchical Clustering, but the latter was surprisingly good

Conclusion The good news: Much more research remains to be done Visit Thank you