Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.

Slides:



Advertisements
Similar presentations
A Comparative mapping resource ONTOLOGY DEVELOPMENT AND INTEGRATION IN GRAMENE Pankaj Jaiswal Cornell University.
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
CHAPTER 23: Two Categorical Variables The Chi-Square Test ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Gene Ontology John Pinney
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
COG and GO tutorial.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis- part 2.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Stat 301 – Day 21 Large sample methods. Announcements HW 4  Updated solutions Especially Simpson’s Paradox  Should always show your work and explain.
Inferences About Process Quality
Protein and Function Databases
BCOR 1020 Business Statistics
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Automatic methods for functional annotation of sequences Petri Törönen.
Testing Hypotheses Tuesday, October 28. Objectives: Understand the logic of hypothesis testing and following related concepts Sidedness of a test (left-,
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Gene Set Enrichment Analysis (GSEA)
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Chi-square Test of Independence Hypotheses Neha Jain Lecturer School of Biotechnology Devi Ahilya University, Indore.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Gene expression analysis
BIOINFORMATIK I UEBUNG 2 mRNA processing.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Issues concerning the interpretation of statistical significance tests.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
a Cytoscape plugin to assess enrichment of
GO : the Gene Ontology & Functional enrichment analysis
Statistical Testing with Genes
Chapter 9 Hypothesis Testing.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Statistical Testing with Genes
Presentation transcript:

Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez

2 Motivation The rise of the genomic era and especially the deciphering of the whole genome sequences of several organism has represented huge quantities of information. New technologies such as DNA microarrays (but not only these!) allow the simultaneous study of hundreds, even thousands of genes, in a single experiment.

3 Motivation This represents different challenges: 1)The experiment in itself 2)Statistical analysis of results 3)Biological interpretation Very often the results are large-lists of genes which have been selected according to some specific criteria. PROBLEM: How could a researcher give these sets a biological interpretation?

4 Rationale A reasonable thing to do is to rely on existing annotations which help to relate the selected sequences with biological knowledge. Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language. The annotation in this form is human readable and understandable, but difficult to interpret computationally.

5 What’s in a name? QUESTION: What’s a cell? Image from The same name can be used to describe different concepts A concept can be described using different names Comparison is difficult, especially across species or across databases

6 Probably, the most important thing you want to know is what the genes or their products are concerned with, i.e. their function. Function annotation is difficult: 1)Different people use different words for the same function, 2)may mean different things by the same word. 3)The context in which a gene was found (e.g. “TGF-induced gene”) may not be particularly associated with its function. Inference of function from sequence alone is error- prone and sometimes unreliable. The best function annotation systems use human beings who read the literature before assigning a function to a gene Functional annotation

7 What can we do? To overcome some of the problems, an annotation system has been created: The Gene Ontology (GO).

8 An ontology is an entity which provides a set of vocabulary terms covering a conceptual domain. These terms must 1)have an exhaustive and rigorous definition, 2)be placed within a structure of relationships. It usually is a hierarchical data structure. The terms may be linked by two kind of relationships: 1)“is-a” between parent and child. 2)“part-of” between part and whole. They may have one or more parents. What is an ontology?

9 What’s the GO? The GO is a cooperative project, developed and maintained by the Gene Ontology Consortium.Gene Ontology Consortium It is an annotation database created to provide a controlled vocabulary to describe gene and gene product attributes in any organism. It is organized around three basic ontologies: OntologyNumber of terms 1 Molecular Function7220 Biological Process9529 Cellular Component1536 Total Terms May, 2005

10 The GO ontologies and the GO graph GO Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)

11 Genes and GO terms A given gene product may represent one or more molecular functions, be used in one or more biological processes and appear in one or more cellular components.

12 Consist of two essential parts: 1)The current ontologies: oVocabulary oStructure 2) The current annotations: oCreate a link between the known genes and the associated GOs that define their function. The GO database exists independently from other annotation databases 1)It does not depend on the organism 2)It does not depend on other databases, but Most important databases have cross-references with the GO databases oIt is possible to link and relate other annotations with those contained in GO GO database

13 Two types of GO Annotations  Electronic Annotation  Manual Annotation All annotations must 1) be attributed to a source, 2) indicate what evidence was found to support the GO term-gene/protein association

14 Evidence Codes IEAInferred from Electronic Annotation ISSInferred from Sequence Similarity IEPInferred from Expression Pattern IMPInferred from Mutant Phenotype IGIInferred from Genetic Interaction IPIInferred from Physical Interaction IDAInferred from Direct Assay RCAInferred from Reviewed Computational Analysis TASTraceable Author Statement NASNon-traceable Author Statement ICInferred by Curator NDNo biological Data available

15 Unbiased method to ask question, “What’s so special about my set of genes?” Many tools follow similar stepsMany tools 1)Obtain GO annotation (most specific term(s)) for genes in your set 2)Climb an ontology to get all “parents” (more general terms) 3)Look at occurrence of each term in your set compared to terms in population (all genes or all genes on your chip) 4)Are some terms over-represented? Enrichment Analysis

16 Statistical Methods for enrichment analysis Let us consider: oN genes on a microarray: M belong to a given GO term category (A) M-N do not belong it (category A c ) oK of the N genes are selected and assigned to a given class (e.g. regulated genes) ox genes of these K will be in A(EXAMPLE)EXAMPLE STATISTICAL HYPOTHESIS: H 0 :GO category A is equally represented on the microarray than in the class of differentially regulated genes H 1 :GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes

17 Hypergeometric Distribution (1/2) We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A? The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters ( N, M, K ).

18 Hypergeometric Distribution (2/2) So, under the null hypothesis p_value of having x genes or larger in A will be: This corresponds to a one-side test in which small p_values relate to over-represented GO terms. For under-represented categories can be calculated as 1 - p_value

19 Disadvantages The hypergeometric distribution is rather difficult and time consuming to calculate when N is high. We can proof, Using this approximation the p_value for over- represented GO terms can be calculated as

20 Alternative approaches Let us assume where N=N.., M=N 1., K=N. 1 and x=n 11 Using this notation, alternative include: o test for equality of two proportions oFisher’s Exact Test Differentially regulated genes (D) DcDc Genes on Microarray Category A n 11 n 12 N1.N1. AcAc n 21 n 22 N2.N2. N. 1 N. 2 N..

21 Fisher’s Exact Test This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table. One can calculate a table containing all possible combinations of n 11 n 12 n 21 n 22. The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.

22 Correction for Multiple Tests As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance: oMethods controlling False Discovery Rate (FDR):  Benjamin and Hochberg (assuming independence)  Benjamin and Yekutieli (dropping independence) oMethods controlling Family Wyse Error Rate (FWER):  Holm correction  Westfall and Young

23 Example N= 9177 genes on microarray A AcAc M= 467 in GO category A N-M= 8710 in A c K= 173 genes picked randomly x= 51 genes of category A