Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.

Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005

Internship site: BioDiscovery, Inc. Mentor: Dr. Bruce Hoff Source of Funding: BioDiscovery, Inc.

Motivation Microarray gene-expression profiling studies to predict disease outcomes. –ex: cancer outcome To improve treatment of patients based on knowledge of gene-expression profile (molecular signature).

Lancet Paper “Prediction of cancer outcome with microarrays: a multiple random validation strategy” Findings of Stefan Michiels et al :- “Gene expression microarray-based predictors of clinical outcome have been poorly optimistic and careful review shows that performance is poor and variable.” - Analyzed data from the 7 largest published studies that have attempted to predict prognosis of cancer patients based on DNA microarray analysis. - Random sampling approach

Goal Reproduce the Lancet paper. Compare the classification based on expression levels of microarray probes, with classification based on GSEA scores of biological pathways. Validate our hypothesis:- –By abstracting away from the gene expression domain to that of biological properties, performance should stabilize and improve.

Phase I : Reproduce the Lancet Paper (Gene-Expression based classification)

Methodology Data loading Data preprocessing Data selection Correlating with clinical outcome Determine the molecular signature Classification of data

Data Loading Read Affymetrix chip expression data. Sample data:

Data Preprocessing Scaling –Identify the present, absent and marginal expressional levels. –scaling the average of the fluorescent intensities of all genes to a constant target intensity of 2500. –Expression values above 45000 capped to 45000 and the ones below 100 to 1. Filtration –Eliminate the genes with low or no variance Log transformation –Log 2 (values)

Preprocessed Data: Before After

Data Selection Training-Validation Approach:- –Training set for identifying the molecular signature. –Validation set for estimating the proportion of misclassifications. Therefore, such that, –Each set includes half the patients with and half without a favorable outcome. Dataset(N) Training(n)Validation(N-n) (Random selection)

Correlation Clinical outcome –Favorable = 1 (continuous complete remission) –Unfavorable = -1 (relapse) Correlate expression values of each gene with the clinical outcome –Pearson’s correlation coefficient Determined the molecular signature – defined by the top 50 highest correlated genes.

Data Classification (Nearest Centroid Prediction Rule) A new point is classified based on which centroid is nearest. Data is 50- dimensional. PCA plot is used to plot the data. Principle component analysis(PCA) is a powerful tool for analysing data by identifying patterns in it. Unfavorable Centroid Favorable Centroid

Results (cont’d.) Each of the 500 training sets provided a different molecular signature Plot of genes that occurred most frequently in the molecular signature.

Analysis The frequency of the genes participating in defining the signature is quite low. This suggests that the molecular signature is selected almost randomly and is unstable.

Phase II Analysis of Microarray data using GSEA (Gene Set Enrichment Analysis) http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

Methodology Data loading Data preprocessing Data selection GSEA – Determine enrichment scores Correlating with clinical outcome Classification of data

Preliminary steps Data loading Data preprocessing same as in phase I Data selection

GSEA Gene Set Enrichment Analysis –A microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. – GSEA provides an enrichment score that measures the degree of enrichment of the gene set of a rank-ordered gene list derived from the data set.

GSEA (cont’d) GSEA Inputs: –List of genes ranked according to the expression difference between two classes. –a priori defined gene sets (ex. pathways), each consisting of members drawn from the list of genes. Ranking of genes is done using a distance metric, Signal-to-Noise ratio (SNR). http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf

Signal to Noise ratio The signal-to-noise ratio method looks at the difference of the means in each of the classes scaled by the sum of the standard deviations: ((α)* sqrt(n)) ÷ σ where α (signal) is the difference in mean expressions of two classes and σ (noise) is the standard deviation.

Implementation Determine SNR for each microarray. Sort gene list based on SNR values. The degree of enrichment of the gene set is measured by comparing the SNR-ordered gene list with the gene set(pathways). http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

If gene is in gene set, increment rank by Y If gene is not in gene set, decrement rank by X X=√G/(N-G) Y=√(N-G)/G G=number of genes in set N=size of data Enrichment Score (ES) ES=greatest positive deviation of this running sum across all genes http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc

Correlation & Classification Similar to phase I –First, the top 50 pathways are selected to create favorable and unfavorable centroids –Next, the training and validation set is classified based on the nearest-centroid prediction rule.

Results(cont’d.) Each of the 500 training sets provided a different molecular signature Plot of pathways that occurred in over 150 of the molecular signatures.

Results Average % =97.88%Average % =93.77% Gene ExpressionGene Set Based

Results (cont’d) Average % =93.80% Average % =96.45% Gene ExpressionGene Set Based

Results (cont’d) Average % =52.91% Average % =75.17% Gene ExpressionGene Set Based

Results (cont’d) Average % =26.48%Average % =47.76% Gene ExpressionGene Set Based

Three significant pathways Iron ion homeostasis –Reduces tumor angiogenesis by protecting cells from oxidative stress Unfolded protein response, positive regulation of target gene transcription –A stress-signaling pathway in tumor cells Tryptophan catabolism –Has an antiproliferative effect on many tumor cells

Conclusion Our results have shown that The centroid classification based on gene expression performs poorly with the validation set. The GSEA method does not perform any better than the gene expression method

Future Work Analysis with a different classification approach. Using much larger data sets from different samples.

Acknowledgements Dr. Bruce Hoff Dr. Soheil Shams SoCalBSI

References 1.Stefan Michiels, Serge Koscielny, Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, Vol. 365, 488–92 (2005). 2.Mootha, V. K., et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, Vol. 34, 267-273 (2003). 3.http://www.broad.mit.edu/gsea/doc/detailed_description_of_g sea_algorithm.doc. 4.http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_re port.pdf 5.http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.

Similar presentations

Presentation on theme: "Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.

Similar presentations

Presentation on theme: "Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005."— Presentation transcript:

Similar presentations

About project

Feedback