Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Biological Information from Gene Lists

Similar presentations


Presentation on theme: "Extracting Biological Information from Gene Lists"— Presentation transcript:

1 Extracting Biological Information from Gene Lists
Simon Andrews, Laura Biggins, Boo Virk v1.0

2 Analysis doesn’t end here!
Analysis of processed sample: Data acquisition – sequencing, microarray analysis, mass spectrometry Biological material Sample for analysis Isolation of DNA, RNA or proteins Sample processing Analysis doesn’t end here! Raw data file(s) Results Table Containing hits – genes, transcripts or proteins Public databases Data analysis: identification of genes, transcripts or proteins

3 Why functional analysis?
Advantages: Biological insight Validation of experiment Generate new hypothesis Limitations: Amount of information depends on the species Will only find known/published links between genes If working on something novel – information available may be limited

4 What this course covers
Morning Introduction to Gene Lists Gene List Practical Coffee Presenting results Presenting Results Practical Afternoon Motif Searching Motif Searching Practical Coffee Networks and Interactions Network Practical Commercial tools

5 Gene Lists Types of gene list: Names of genes
Names of genes ordered by qualitative value Names of genes ordered by quantitative value Gene lists can be ranked P-value Other Stat Ordered Gene lists can be filtered Cut off point Subset of genes

6 Transforming Gene ID’s
Need to use relevant ID to extract information from databases BioMart ID conversion tool allows us to do this easily and quickly online

7

8

9 Download this data, import transformed ID’s into table

10 I have my gene list, what next?
Hyperlinked table: Gene UniProt Name Score Reactome TP53 P04637 Tumor Suppressor p53 125527 CDK1 P06493 Cyclin-dependent kinase 1 113740 POLE Q07864 DNA Polymerase Epsilon 107190 KPNB1 Q14974 Importin subunit beta-1 35542 CHEK1 O14757 Serine/threonine-protein kinase Chk1 35271 AURKB Q96GD4 Aurora kinase B 30803 RPA2 P15927 Replication protein A 32 kDa subunit 22207 CDT1 Q9H211 DNA replication factor Cdt1 21735 MCMBP Q9BTE3 MCM complex-binding protein 17811 TUBG1 P23258 Tubulin gamma-1 chain 16895 RAN P62826 GTP-binding nuclear protein Ran 16384 RANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527 BLM P54132 Bloom syndrome protein 14883 PCNA P12004 Proliferating Cell Nuclear Antigen 13982 SETD8 Q9NQR1 Pr-Set7 13711 RCC1 P18754 Regulator of chromosome condensation 13302 MCM5 P33992 DNA replication licensing factor MCM5 12806 CDC25C P30307 M-phase inducer phosphatase 3 12510 PLK1 P53350 Serine/threonine-protein kinase PLK1 10930 MZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210

11 Hyperlinked tables Advantages However…
Easy to create –no special data-mining software needed One-click direct access to relevant pages Reference resource However… Need to become familiar with the resources available - tailor hyperlinks to be specific for your organism and questions being asked Information on one gene at a time in your gene set Need to get relevant resource ID’s

12 I have my gene list, what next?
Annotated table: Gene UniProt Name Score PANTHER GO-Slim BP CHEK1 O14757 Serine/threonine-protein kinase Chk1 35271 apoptotic process;nitrogen compound metabolic process;biosynthetic process;transcription from RNA polymerase II promoter;cellular protein modification process;cell cycle;cell communication;apoptotic process;response to stress;response to abiotic stimulus;regulation of transcription from RNA polymerase II promoter;regulation of cell cycle;chromatin organization MCM5 P33992 DNA replication licensing factor MCM5 12806 cell cycle;cell communication RCC1 P18754 Regulator of chromosome condensation 13302 cellular component movement;mitosis;chromosome segregation;cellular component morphogenesis;intracellular protein transport;cellular component organization CDC25C P30307 M-phase inducer phosphatase 3 12510 DNA replication;cell cycle PLK1 P53350 Serine/threonine-protein kinase PLK1 10930 DNA replication;DNA repair;DNA recombination;cell cycle TP53 P04637 Tumor Suppressor p53 125527 glycogen metabolic process;protein phosphorylation;mitosis;cell communication CDK1 P06493 Cyclin-dependent kinase 1 113740 nitrogen compound metabolic process;biosynthetic process;DNA replication;RNA metabolic process;cellular process;regulation of biological process;regulation of catalytic activity BLM P54132 Bloom syndrome protein 14883 nucleobase-containing compound metabolic process;cell cycle;cell communication;RNA localization;intracellular protein transport;nuclear transport RPA2 P15927 Replication protein A 32 kDa subunit 22207 nucleobase-containing compound metabolic process;mitosis;nucleobase-containing compound transport;regulation of catalytic activity TUBG1 P23258 Tubulin gamma-1 chain 16895 phosphate-containing compound metabolic process;cellular protein modification process;cell cycle KPNB1 Q14974 Importin subunit beta-1 35542 phosphate-containing compound metabolic process;protein phosphorylation;cytokinesis;cell cycle;regulation of cell cycle;chromatin organization;cytoskeleton organization MZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210 protein targeting;nuclear transport PCNA P12004 Proliferating Cell Nuclear Antigen 13982 RAN P62826 GTP-binding nuclear protein Ran 16384 POLE Q07864 DNA Polymerase Epsilon 107190 AURKB Q96GD4 Aurora kinase B 30803 MCMBP Q9BTE3 MCM complex-binding protein 17811 CDT1 Q9H211 DNA replication factor Cdt1 21735 RANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527 SETD8 Q9NQR1 Pr-Set7 13711

13 Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

14 Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

15 What is Gene Ontology (GO)?
Collaborative effort addressing need for consistent descriptions of gene products across different databases GO project has three structured ontologies describing gene products independent of species: Biological Processes (BP), Cellular Components (CC) Molecular Functions (MF)

16 GO Structure 3 GO domains: Root ontology terms general Parent specific
1 2 3 Root ontology terms general specific Parent Child

17 Subsets of GO terms GO slim terms: GO fat terms:
Cut-down versions of the GO ontologies that contain a subset of terms from the GO resource Give a broad overview of the ontology content without the detail of the specific, fine-grained terms GO fat terms: subset comprising more specific terms

18 Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

19 Pathway and Interactions
Are specific pathways enriched in my list? What other genes are in this pathway? Which genes/gene products interact with my genes of interest? Databases include:

20 Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

21 Protein Domain Can I find shared protein domains?
What is the function of shared domain? Which other proteins share this domain? Databases include:

22 Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

23 Co-expression Which genes are co-expressed?
Automatic grouping of genes (rather than human curation (GO)) Databases include:

24 Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

25 Annotated tables Advantages
Information on function from larger gene sets Sort groups of genes (GO term, pathway, protein domain) Relatively easy to create Reference resource However… Need to become familiar with the resources specific to your research Lots of information can be difficult to sort efficiently

26 What does functional information tell me?
Have functional information about a gene set Can verify genes implicated in experiment are functionally relevant, and to discover unexpected shared functions Determine which functions are enriched in gene set How? Compare to a background list of genes

27 What is a background list?
In theory, any gene that could have been differentially expressed in your experiment RNA seq – all genes apart from those with less than reads Arrays – all genes in the array ChipSeq - any gene on the chip. vs

28 Choosing a Background List
Which background list to use? Whole set of genes Tissue/cell specific genes Manually made list, derived from your experiment and analysis vs ? Choice of background list place where things are likely to go wrong. Experimental design and making the most appropriate comparison Manual list e.g. based on expression level

29 Statistics to test for enrichment
13,101 genes on chip Related to disease 260/747 = 34.8% Gene List Are these proportions the same? 3005 genes related to disease 3005/13,101= 23.1% Do not related to disease 487/747 = 65.2%

30 Lots of Statistical tests to choose from
• Hypergeometric test • Fisher’s exact/Chi-squared • Binomial • Kolmogorov Smirnov • Permutation

31 Hypergeometric test Uses hypergeometric distribution to measure the probability of having drawn a specific number of successes (out of a total number of draws) from a population Example: Imagine that there are 4 green and 16 red marbles in a box. You close your eyes and draw 5 marbles without replacement What is the probability that exactly 2 of the 5 are green?

32 Are these proportions the same?
Gene List 3005 genes map to disease 3005/13,101= 23.1% 13,101 genes on chip Map to disease 260/747 = 34.8% Do not map to disease 487/747 = 65.2% Are these proportions the same? What is the probability (p-value) that exactly 260 genes (out of 747) map to disease, given that there are 3005 of those genes in the background (13,101 genes)?

33 Hypergeometric test Limitations: Assumes independence of categories
Input Sample Size Output Specifics Hypergeometric Unranked/Ranked List Large (5% of background) P-value Finite population – probability of success changes Limitations: Assumes independence of categories Result terms often include directly related terms Is there really evidence for both terms? Works better with larger samples (5% of background)

34 Based on Hypergeometric test:
Input Sample Size Output Specifics Hypergeometric Unranked/Ranked List Large (5% of background) P-value Finite population – probability of success changes Based on Hypergeometric test: Test Input Sample Size Output Specifics Fisher’s Exact Unranked/Ranked List Small P-value Can be used to compare 2 conditions as well as gene list to background one-tailed or two-tailed Binomial Large Does not assume finite population – probability of success remains the same

35 Limitations of Fisher’s Exact and Binomial test
Neither account for variation in the number of genes annotated to individual terms/functions being tested or the number of terms/functions associated with individual genes Therefore, tend to over-estimate significance if the gene set has an unusually high number of annotations Assume independence of categories

36 Lots of Statistical tests to choose from
• Hypergeometric • Fisher’s exact/Chi-squared • Binomial • Kolmogorov Smirnov • Permutation Used for ranked gene lists only Output: enrichment scores (ES) for functions, which can then be translated into a p-value

37 Multiple testing correction
Error types in statistics: Statistical Decision: True state in Gene List Not Overrepresented Overrepresented Significant Type I error (False Positive) Correct Not Significant Type II error (False Negative) Traditionally, a test or a difference are said to be “significant” if the probability of type I error is: α =< 0.05

38 Probability of error increases from 5% to 14.3%
Example: You want to compare 3 groups and you carry out 3 hypergeometric tests, each with a 5% level of significance (P<0.05) Probability of not making type I error = 95% = (1 – 0.05) Overall probability of no type I errors is: 0.95 * 0.95 * 0.95 = 0.857 Therefore probability of at least one type I error is: = or 14.3% If comparing 5 groups instead of 3, the multiple testing error rate is 40%! (=1-(0.95)n) Solution for multiple comparisons: Multiple testing correction Probability of error increases from 5% to 14.3%

39 Multiple test corrections
Bonferroni Significant level (e.g. 0.05) /number of tests = new threshold This is an over correction if tests are correlated Benjamini-Hochberg Rank the p-values Apply more stringent correction to the most significant, and least stringent to the least significant p-values

40 Statistical issues • We want to Identify functions of maximal biological significance – BUT this is not perfectly correlated with statistical significance • Use p‐values as a tool to rank functions but don’t take them too literally • Need to correct for multiple testing

41 Tools for functional gene list analysis
There are many different tools available, both free and commercial Popular web-based tools include:

42 PANTHER (Protein ANnotation THrough Evolutionary Relationship) http://www.pantherdb.org/
One of the most widely used online resources for gene function classification and genome wide data analysis PANTHER users have successfully analysed data from: Gene expression Proteomics Genome-wide association study (GWAS) experiments PANTHER is part of the GO consortium, thus PANTHER annotation = up to date GO curation

43 PANTHER for functional classification

44

45

46

47

48 Send list to > File Saves table in a tab delimited .txt file

49 PANTHER for statistics

50

51 Annotations from PANTHER include:
GO-slim terms PANTHER “protein class” PANTHER “Pathway” terms Doesn’t cluster together genes with similar GO terms in table Statistics: Binomial test with Bonferroni multiple testing correction

52 https://david.ncifcrf.gov/
Gathers data from many different databases – this is customisable Functional Clustering Uses many annotations, including GO-Fat terms – more specific set of GO terms Statistics: Fisher’s Exact Test and multiple testing correction

53 DAVID for functional classification

54

55

56

57

58

59

60

61

62 Functional Clustering
Enrichment score for the whole cluster rather than individual functions, DAVID anything above 2 or 3 is considered as enriched

63 Which DAVID tool should I use?

64 GOrilla http://cbl-gorilla.cs.technion.ac.il/

65 Which tool to use? Choose a tool that: – Includes your gene / probe identifiers – Includes your species – Has up‐to‐date annotation – Lets you define your background (if possible) – Try a few different tools – Try gene lists of varying length


Download ppt "Extracting Biological Information from Gene Lists"

Similar presentations


Ads by Google