Download presentation
Presentation is loading. Please wait.
1
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez
2
2 Motivation The rise of the genomic era and especially the deciphering of the whole genome sequences of several organism has represented huge quantities of information. New technologies such as DNA microarrays (but not only these!) allow the simultaneous study of hundreds, even thousands of genes, in a single experiment.
3
3 Motivation This represents different challenges: 1)The experiment in itself 2)Statistical analysis of results 3)Biological interpretation Very often the results are large-lists of genes which have been selected according to some specific criteria. PROBLEM: How could a researcher give these sets a biological interpretation?
4
4 Rationale A reasonable thing to do is to rely on existing annotations which help to relate the selected sequences with biological knowledge. Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language. The annotation in this form is human readable and understandable, but difficult to interpret computationally.
5
5 What’s in a name? QUESTION: What’s a cell? Image from http://microscopy.fsu.edu The same name can be used to describe different concepts A concept can be described using different names Comparison is difficult, especially across species or across databases
6
6 Probably, the most important thing you want to know is what the genes or their products are concerned with, i.e. their function. Function annotation is difficult: 1)Different people use different words for the same function, 2)may mean different things by the same word. 3)The context in which a gene was found (e.g. “TGF-induced gene”) may not be particularly associated with its function. Inference of function from sequence alone is error- prone and sometimes unreliable. The best function annotation systems use human beings who read the literature before assigning a function to a gene Functional annotation
7
7 What can we do? To overcome some of the problems, an annotation system has been created: The Gene Ontology (GO).
8
8 An ontology is an entity which provides a set of vocabulary terms covering a conceptual domain. These terms must 1)have an exhaustive and rigorous definition, 2)be placed within a structure of relationships. It usually is a hierarchical data structure. The terms may be linked by two kind of relationships: 1)“is-a” between parent and child. 2)“part-of” between part and whole. They may have one or more parents. What is an ontology?
9
9 What’s the GO? The GO is a cooperative project, developed and maintained by the Gene Ontology Consortium.Gene Ontology Consortium It is an annotation database created to provide a controlled vocabulary to describe gene and gene product attributes in any organism. It is organized around three basic ontologies: OntologyNumber of terms 1 Molecular Function7220 Biological Process9529 Cellular Component1536 Total Terms18285 1 May, 2005
10
10 The GO ontologies and the GO graph GO Molecular Functions (MF) Biological Processes (BP) Cellular Components (CC)
11
11 Genes and GO terms A given gene product may represent one or more molecular functions, be used in one or more biological processes and appear in one or more cellular components.
12
12 Consist of two essential parts: 1)The current ontologies: oVocabulary oStructure 2) The current annotations: oCreate a link between the known genes and the associated GOs that define their function. The GO database exists independently from other annotation databases 1)It does not depend on the organism 2)It does not depend on other databases, but Most important databases have cross-references with the GO databases oIt is possible to link and relate other annotations with those contained in GO GO database
13
13 Two types of GO Annotations Electronic Annotation Manual Annotation All annotations must 1) be attributed to a source, 2) indicate what evidence was found to support the GO term-gene/protein association
14
14 Evidence Codes IEAInferred from Electronic Annotation ISSInferred from Sequence Similarity IEPInferred from Expression Pattern IMPInferred from Mutant Phenotype IGIInferred from Genetic Interaction IPIInferred from Physical Interaction IDAInferred from Direct Assay RCAInferred from Reviewed Computational Analysis TASTraceable Author Statement NASNon-traceable Author Statement ICInferred by Curator NDNo biological Data available
15
15 Unbiased method to ask question, “What’s so special about my set of genes?” Many tools follow similar stepsMany tools 1)Obtain GO annotation (most specific term(s)) for genes in your set 2)Climb an ontology to get all “parents” (more general terms) 3)Look at occurrence of each term in your set compared to terms in population (all genes or all genes on your chip) 4)Are some terms over-represented? Enrichment Analysis
16
16 Statistical Methods for enrichment analysis Let us consider: oN genes on a microarray: M belong to a given GO term category (A) M-N do not belong it (category A c ) oK of the N genes are selected and assigned to a given class (e.g. regulated genes) ox genes of these K will be in A(EXAMPLE)EXAMPLE STATISTICAL HYPOTHESIS: H 0 :GO category A is equally represented on the microarray than in the class of differentially regulated genes H 1 :GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes
17
17 Hypergeometric Distribution (1/2) We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A? The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters ( N, M, K ).
18
18 Hypergeometric Distribution (2/2) So, under the null hypothesis p_value of having x genes or larger in A will be: This corresponds to a one-side test in which small p_values relate to over-represented GO terms. For under-represented categories can be calculated as 1 - p_value
19
19 Disadvantages The hypergeometric distribution is rather difficult and time consuming to calculate when N is high. We can proof, Using this approximation the p_value for over- represented GO terms can be calculated as
20
20 Alternative approaches Let us assume where N=N.., M=N 1., K=N. 1 and x=n 11 Using this notation, alternative include: o test for equality of two proportions oFisher’s Exact Test Differentially regulated genes (D) DcDc Genes on Microarray Category A n 11 n 12 N1.N1. AcAc n 21 n 22 N2.N2. N. 1 N. 2 N..
21
21 Fisher’s Exact Test This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table. One can calculate a table containing all possible combinations of n 11 n 12 n 21 n 22. The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.
22
22 Correction for Multiple Tests As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance: oMethods controlling False Discovery Rate (FDR): Benjamin and Hochberg (assuming independence) Benjamin and Yekutieli (dropping independence) oMethods controlling Family Wyse Error Rate (FWER): Holm correction Westfall and Young
23
23 Example N= 9177 genes on microarray A AcAc M= 467 in GO category A N-M= 8710 in A c K= 173 genes picked randomly x= 51 genes of category A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.