Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis (GSEA)
RESPONSIBLE AUTHORSHIP Office for Research Protections The Pennsylvania State University Adapted from Scientific Integrity: An Internet-based course in.
Timothy H. W. Chan, Calum MacAulay, Wan Lam, Stephen Lam, Kim Lonergan, Steven Jones, Marco Marra, Raymond T. Ng Department of Computer Science, University.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Exploring gene pathway interactions using SOM Keala Chan SoCalBSI August 20, 2004.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Using visualization and network analysis to assist function analysis of microarray data Hepatitis C Virus (HCV) Micorarray Data Function Analysis Current.
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.
Bioinformatics Tools for Microarray Analysis Connie Wu Dr. Jim Breaux Dr. Sandeep Gulati ViaLogy Southern California Bioinformatics Institute Summer 2004.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Research Methods Steps in Psychological Research Experimental Design
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing in high- throughput biology Petter Mostad.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
GO::TermFinder Gavin Sherlock Department of Genetics Stanford University
1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Estimation of Statistical Parameters
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 9-2 Inferences About Two Proportions.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
RNAseq analyses -- methods
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
By: TARUN MEHROTRA 12MCMB11.  More time is spent maintaining existing software than in developing new code.  Resources in M=3*(Resources in D)  Metrics.
Inferring Function From Known Genes Naomi Altman Nov. 06.
CellFateScout step- by-step tutorial for a case study Version 0.94.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
BIOS6660 shRNAseq Gene Set Enrichment Analysis Tzu L Phang PhD Robert Stearman PhD April 16, 2014.
Statistical Testing with Genes Saurabh Sinha CS 466.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Final Year Project – I Smart Recruiter Group Members: Uzair Siddiqui [05363] Rehma Ather [05625] Meeran Khan [05364] Syed Maaz Alam [05284] Supervisor.
The Broad Institute of MIT and Harvard Differential Analysis.
GO enrichment and GOrilla
Microarray Data Analysis The Bioinformatics side of the bench.
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Pathway Ranking Tool Dimitri Kosturos Linda Tsai SoCalBSI, 8/21/2003.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
McGraw-Hill © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Slide 1 Sociological Research SOCIOLOGY Richard T. Schaefer 2.
Guidelines for building a bar graph in Excel and using it in a laboratory report IB Biology (December 2012)
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Identification of aging-related genes and affected biological processes. Identification of aging-related genes and affected biological processes. (A) Experimental.
Extended analysis of differential expression datasets.
Presentation transcript:

Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004

Project Overview Specializes in microarray data analysis software –Image Analysis –Data Analysis –Data Management How can microarray data be used to find information about biological pathways? Project: explore different ways to extract information about biological pathways from microarray data. CONFIDENTIAL

Sample Microarray Data Microarrays can provide information on differential expression between conditions. The most differentially expressed genes are singled out for further study. Healthy Infected Gene Gene Gene Gene Gene Gene Gene Gene Gene Genes Conditions Gene 3 would be selected for further study. CONFIDENTIAL

A Different Approach Difficulties With Old Approach No gene is significantly differentially expressed. Many genes are significantly differentially expressed. Not making use of prior knowledge. A Different Approach Look for affected biological processes, sometimes called pathways, instead of individual genes. Need a way to convert a list of differential gene expression values into scores for pathways. The way to do that is through a scoring metric. CONFIDENTIAL

Method of Ranking Pathways CONFIDENTIAL Scoring Metric Microarray Data Annotations A score for each Pathway: indicates how much it was affected by the condition Many Different Scoring Metrics Available

A Simple Metric Gene NamesP-Value 1)200078_s_at )201172_x_at )205473_at )208678_at )214244_s_at )230565_at )36994_at )39144_s_at Photosynthesis Score = # of genes below 0.2 total # of genes in pathway 5 genes have a P-value below 0.2 out of 8 genes in this pathway Score = 5/8 = CONFIDENTIAL

Project Goals I. Analyze and compare different scoring metrics –How similar are the different metrics? –Which metric produces the most biologically significant results? –When should we use a particular metric over another? II. Explore known ranking metrics –How and why do they work? –Is there a way to improve them or design a better one? CONFIDENTIAL

The Metrics Investigated Enrichment – the original method first used to rank pathways, it is still widely used today GSEA (Gene Set Enrichment Analysis) – a recently published* method using a Kolmogorov-Smirnov statistic Shams 1 Shams 2 Shams 3 } Potential BioDiscovery Scoring Metrics CONFIDENTIAL * Mootha, et al, “PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.” (Nat Genet Jul;34(3):267-73).

Part I: Compare the metrics Compare each metric to all the others to see if they produce similar results. If they are very similar, it doesn’t matter which one we use. If they are different, which one is correct? Or, are they both correct? CONFIDENTIAL

How to Compare Metrics Wrote a program that does the following: (1)Rank the same pathways using different metrics (2)Take the top pathways from each ranking (3)Count the number of pathways that are in common among the top pathways being considered (4)Construct a % similarity score = # of pathways in common divided by the total number of pathways CONFIDENTIAL

An Example PATHWAY NAME SCOREPATHWAY NAME SCORE tachykinin signaling pathway2.051nucleotide-sugar metabolism endothelin receptor activity1.983endothelin-B receptor activity270.5 initiation factor 4F complex1.965endothelin receptor activity nucleotide-sugar metabolism1.807odorant binding sarcoglycan complex1.829 interleukin-6 receptor binding ubiquitin C-terminal activity1.664histone deacetylation odorant binding 1.539clathrin binding interleukin-6 receptor binding1.507female germ-cell nucleus cysteine-type peptidase activity1.488AMP deaminase activity clathrin binding1.432cholecystokinin receptor activity female germ-cell nucleus 1.422malate metabolism AMP deaminase activity1.415malate activity anticoagulant activity1.337malate dehydrogenase protease activator activity1.336delta-opioid receptor activity112.8 malate metabolism1.326vasculogenesis109.0 SHAMS IGSEA Compare the top 12 pathways from each metric. CONFIDENTIAL

An Example PATHWAY NAME SCOREPATHWAY NAME SCORE tachykinin signaling pathway2.051nucleotide-sugar metabolism endothelin receptor activity1.983endothelin-B receptor activity270.5 initiation factor 4F complex1.965endothelin receptor activity nucleotide-sugar metabolism1.807odorant binding sarcoglycan complex1.829 interleukin-6 receptor binding ubiquitin C-terminal activity1.664histone deacetylation odorant binding 1.539clathrin binding interleukin-6 receptor binding1.507female germ-cell nucleus cysteine-type peptidase activity1.488clathrin binding159.2 cholecystokinin receptor activity 1.385malate metabolism female germ-cell nucleus 1.422AMP deaminase activity AMP deaminase activity1.415malate activity anticoagulant activity1.337malate dehydrogenase protease activator activity1.336delta-opioid receptor activity112.8 malate metabolism1.326vasculogenesis109.0 SHAMS IGSEA 6 Matches out of 12 Total Pathways = 50% Similarity CONFIDENTIAL

Repeat The Process First, take the top 10 pathways. Then take the top 20 pathways. Then take the top 30 pathways.. Continue until a pattern is seen. CONFIDENTIAL

Example Graph of Results Cut-Off Value (out of 2646 pathways) % Similarity Between Shams 1 and Shams 2, the top 20 pathways have about 36% Similarity CONFIDENTIAL

Results No two metrics were very similar in any dataset tested (i.e. 85%+) Percent Similarities differed greatly between different datasets – no two metrics demonstrated a consistent amount of similarity. Since the metrics ranked the pathways differently… Which metrics are correct? Or are they all correct? Begin by verifying and understanding what has already been researched – GSEA. CONFIDENTIAL

Part II: Exploring a Metric: GSEA Gene Set Enrichment Analysis A result of the collaboration of many individuals from a number of institutions including MIT and Harvard. Devised in order to identify the pathways that are significantly affected in individuals with type 2 diabetes compared to healthy individuals. How, exactly, does GSEA work? Is our implementation correct? CONFIDENTIAL

How GSEA works (1) Rank the genes based on differential expression #GeneP-Value (T-test) _at _s_at _s_at _s_at _s_at _at _at _s_at _x_at _at _at Then pathway one is given a higher score than pathway two. And pathway two contains these three genes (2) Compute a score for each pathway based on where the genes of that pathway appear. If pathway one contains these three genes CONFIDENTIAL

Importance of a P-Value A metric will always produce a ranking. Is the ranking we get significant or could it have been generated randomly? Answer: We need to compute a P-value to make sure that the score we get is unlikely to have been produced by chance. CONFIDENTIAL

Constructing a P-value (1)Permute class labels 1000 times (2)Rank the pathways with each different permutation (3)Create a histogram of top values based on the permutations (4)Figure out where in the histogram the actual data lies – shows how significant the score is. CONFIDENTIAL

Constructing a P-value GSEA SCORE Number of Permutations If the actual score falls here, the score is significant But, if the actual score falls here, the score is not significant CONFIDENTIAL

Implementation BioDiscovery already had an implementation of the GSEA scoring metric. What I did: –Tweaked the code so that it works better and functions more like the original published method. –Extended the code to compute a P-value to measure the significance of GSEA scores. CONFIDENTIAL

Results of GSEA analysis A better understanding of how GSEA operates especially in comparison to other potential metrics. A good implementation of the GSEA metric. An implementation of a permutation analysis to judge the significance of calculated scores. CONFIDENTIAL

Next steps Extend the GSEA implementation of permutation analysis for all the metrics to verify the significance of the results. Submit these significant results to biologists to see which metrics make the most sense. Final Step: Integrate the best metrics and the permutation analysis into one application for biologists. CONFIDENTIAL

Acknowledgments Special Thanks to: Dr. Soheil Shams Dr. Bruce Hoff Keala Chan The staff of BioDiscovery, Inc. The professors of SoCalBSI The students of SoCalBSI Funding Provided by: National Science Foundation National Institutes of Health CONFIDENTIAL

Works Cited Mootha, et al, “PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.” (Nat Genet Jul;34(3):267-73). Damian D, Gorfine M., “Statistical concerns about the GSEA procedure” (Nat Genet Jul;36(7):663; author reply 663) Confidential Documents of BioDiscovery, Inc CONFIDENTIAL