Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using GeneDB and the Gene Ontology annotation Basic searching.

Similar presentations


Presentation on theme: "Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using GeneDB and the Gene Ontology annotation Basic searching."— Presentation transcript:

1 Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using GeneDB and the Gene Ontology annotation Basic searching and browsing Anatomy of the GeneDB Genepage (overview of page contents) Simple data mining and analysis Create user defined gene sets and Download gene sets in various formats Combine (union, intersect and subtract) to make and refine user defined lists “GO slimming” GO enrichment” exercises

2 Fission Yeast Computing Workshop -2- Basic Search /Browse tips 1 2 5 1. This searches ONLY the gene name and product line 2. This searches full text of the page. It is advisable to use quotes for compound terms e.g.“mitotic cyclin” as mitotic cyclin will search “mitotic and cyclin” In addition PMID:19250904 will not work but ”PMID:19250904” will. This isn’t a good way to retrieve gene sets (we will look at better ways) although it is useful for quickly getting to a single gene page. You can also use this search to quickly locate the pombe ortholog of a cerevisiae Gene, but you need the systematic ID...e.g. YPR070W 3. Register gene names pre-publication here 4. Mailing lists pombelist and curated S. cerevisiae orthologs 5. Browse catalogues 3 4

3 Fission Yeast Computing Workshop -3- Anatomy of a gene page Location 1. Chromosome, coordinates 2. Context map 3. GBrowse 4. Artemis (EMBL format or Artemis applet) General information Gene names Product (unique) Access to protein and DNA sequence Access to various Blast 1 4 3 2

4 Fission Yeast Computing Workshop -4- Curation Includes viability (if available, will soon be genome wide) Species distribution Phenotype (new), not comprehensive Name derivations Disease associations Post-translational modifications S. cerevisiae orthologs Domain and family information, (but only when the there are more members than identified by Pfam) Targets Information about expression and regulation Protein feature info coiled coil, cleavage site By using controlled vocabulary can group “like” features. Eventually these will be captured by more formal ontologies (phenotypes-> PATO) Curation “terms” are listed and grouped in the Curation browsable list (e.g. below) Curation terms can be used in the Boolean query tool (later exercise)

5 Fission Yeast Computing Workshop -5- Gene Ontology (GO) Annotation We now have good breadth of annotation, especially to high level terms (demonstrated in later practical); Depth (specificity) could be improved. Annotations are supported by an evidence code and a source (publication) Sometimes a qualifier is used to provide extra information about an annotation A term is automatically annotated to all of its parents, so if you wanted to find other genes which might be related to this process you can go up the graph The GO term on the Gene Page is linked to the AmiGO GO browser “term page”. Term information has definition and synonyms Scroll down the page for the term lineage graph. This shows parents, for example you may wish to go up the graph (or tree view) to access 42 gene products annotated to the parent “spindle organisation” which will also include these 4 genes. NOTE: “spindle organisation” does not appear on mto1 gene page (even though mto1 is annotate to this term), so a the “full text search” would not necessarily retrieve all genes annotated to a GO term. The complete lists of genes annotated to a term can be accessed: i) Through AmiGO (by going to the term page and accessing the product list) ii) From a page with a direct annotation iii) Through the Boolean query interface (later) From GeneDB this view is filtered to show only pombe annotations but you can change to other species filters to access the results in other organisms.

6 Fission Yeast Computing Workshop -6- External Links: http://128.40.79.33 Pfam Pfam is a database of protein domains and protein families each represented by non-overlapping multiple sequence alignments. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. These cover a large proportion of sequences in the sequence databases (83% for pombe- the highest coverage or any eukaryote). In order to give further coverage, these are supplemented by automatically generated entries called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions with no Pfam-A entries.

7 Fission Yeast Computing Workshop -7- AmiGO is the official browser of the GO consortium. Allows users to Browse the GO ontology Search the GO ontology View annotations to terms in different species AmiGO is the GO browser we use for GeneDB This is a separate installation of AmiGO is a installation from the GO site one, an important difference is that the GeneDB implementation includes IEA annotations and will give better coverage (although the GO site should also soon support IEAs in AmiGO). You can tell which version you are using from the URL. Go to the GeneDB version http://www.genedb.org/amigo-cgi/search.cgi? Simple Searching and browsing GO with AmiGO You can search for gene names or identifiers OR GO terms Search “GO terms” for “DNA repair” The most relevant results should be near the top of your results. Click on the term name to take you to The “term details” page

8 Fission Yeast Computing Workshop -8- You can access broader and narrower terms (move up and down the tree) from the term lineage. Broader terms can be useful to identify lists of related genes. Scroll down the page to see the term lineage and numbers of genes annotated To this term in ALL organisms Filter on “data source GeneDB S. pombe” to retrieve only fission yeast annotations (now 152) Links to the associations (gene product annotations) Note that you can also access lists of annotations to a term from the Gene page of any annotated gene product in GeneDB (both direct and indirect) provided there is at least one direct annotation to this term in the genome.

9 Fission Yeast Computing Workshop -9- Return to gene product search, (front page) using your “back button” Search for a fission yeast genes. This search will return any genes products where the gene product name matches the search term. If you still have the filter on you will retrieve only pombe genes which match this name. You can disable the filter under the search option. From the “Gene product search” you can access all GO annotations to individual gene products

10 Fission Yeast Computing Workshop -10- Is_a relationship Part_of relationship Leaf node or no children Node has been opened, can be clicked to close Node has children, can be clicked to view children Browsing Browse the high level biological process terms by opening the nodes “+” For “biological process” “cellular process” Set the filter for data source GeneDB_Spombe, you will notice that almost all fission yeast annotated gene products are annotated to “cellular process” Browsing can be used to identify sets of genes of interest, or to locate a term if you can’t find it by searching, or to see how high level terms relate to each other (as when building a slim later). From wherever you are in AmiGO, click “Browse” in the menu bar to take you to this view:

11 Fission Yeast Computing Workshop -11- Using the “Boolean query interface” to select and download some user defined gene sets http://www.genedb.org/genedb/pombe The boolean query interface entry point is from the S. pombe GeneDB front page First a complete list of protein coding genes, their identifiers and products You can construct queries (AND(Intersect)/OR(union) directly in this interface, but it is much simpler to perform single queries and combine them in the query history. In addition, you can subtract queries from each other in the history, but you can’t in the query builder. Select “genes of a certain type” and “proceed to next step” from the pull down menu Select “protein coding” and “submit form the next view. The results page will provide you with a list of all protein coding genes which you can “page through” at 20 items per page. The link “visit the history page” takes you to your query history from where you can refine queries and download results sets in various formats. Exercise 1: Download a protein set

12 Fission Yeast Computing Workshop -12- 5025 is actually slightly higher than the actual protein coding gene total. This is because transposons contain a protein coding open reading frame and are annotated as CDS (coding sequence). Use the “back button” on your browser to return to the “Boolean query selector interface” Select the data type “annotation Status” Select the data type “transposable element” and “Submit” Go to the history page, you will now have the results of both queries in your query history manager. Select (using the checkboxes) both results sets and subtract query 2 from query 1 to give the current set of protein coding genes. Note: to do a subtraction query, you need to ensure that the query you wish to subtract is below the query you want to subtract from so in this case you need to make sure that you perform the queries in the correct order. The order does not matter for Intersection and Union queries. Click the link “Download” next to your final results set. Tip: At this page you can also supply your own lists to perform the following download operations Query history

13 Fission Yeast Computing Workshop -13- Download Options Scroll down the page to see the download options. We want a tab delimited file with sequence ID and product. Tab delimited is the default, and ID should be pre-selected, so you should only need to select “product”. The sequence options are for Fasta files Only, and are not applicable to the “tab” delimited file Submit the query with output destination “normal page” The output will be a list of ID’s and products in tab-delimited format Go back and change the output destination to “Save as” This will allow you to save to disk, the file will download with the name IdListFormHandler, so you will need to rename it to something sensible. We will use this file later Tip: From this interface, for any results set, or user defined list, you can also download i)Fasta format protein sequence file ii)Fasta Format DNA sequence file iii)5’ or 3’ DNA sequence of user specifed length for each CDS iv)Each CDS with the 5’ AND 3’ regions of specified length

14 Fission Yeast Computing Workshop -14- Exercise 2: Boolean Queries, protein status query You can recreate this data in the Boolean query interface, select queries for annotation status “conserved hypothetical”, “role inferred from homology” “experimentally characterised”, “sequence orphan” and “S. pombe specific families”. Go to the query history and union these datasets. Although the numbers may have changed slightly, the total should be close to 4947. Why is this different from the protein coding total in the previous exercise? Subtract query 2 from query 1 to see what the difference is. You may wish to exclude these “genes” from future queries. You can recreate this data in the Boolean query interface, select queries for “proteins with curation containing a specific word or phrase” then run queries for the phrases (keywords) “conserved in Metazoa” and “conserved in fungi only” Intersect both of these queries with conserved hypothetical query (i.e. conserved unknowns). If you have time you can query for Genes with a specific GO component “nucleus” And “curation” “predominantly uniformly single copy” to get those which are single copy in most organsims

15 Fission Yeast Computing Workshop -15- What makes a good “slim” ? This depends a lot on what you want your slim to show but there are some general considerations: 1.If you are trying to make a slim for the entire genome you should try to ensure that it covers as many annotated terms as possible, but you might ant to avoid terms with excessively large or small numbers of annotations (to avoid extreme distributions in your histogram). You should be aware of how many terms are annotated but not in your slim, and how many terms are “unknown” (I.e annotated only to the root node). 2.You may want to keep the number of terms as small as possible to convey your results (for display purposes). However, you still need to include the “biologically relevant“ terms. Many terms (I.e metabolic process (2915 annotation), cellular process (4083 annotations) are too “general” for the purpose of most “slims” 4. The slim should probably exclude sibling terms with a large overlaps between their annotations If you choose two siblings with 200 genes annotated to each, and the majority of the annotations overlap, it may be better to select the parent node (i.e replace 2 terms by one single term). Conversely, if the child terms of a node fall into distinct non-overlapping subsets, it might be more informative to include both child terms in your slim (for example the term transport see below.) 5. For most purposes you need to include a representative term for all biologically relevant processes, by including terms which are meaningful, especially if you are defining a slim for a specific purpose. 6. If you are using your slim for data analysis (and not just for visualization) you need to include terms which will allow you to distinguish genes bases on their biological properties. For example, it is not good to lump all genes involved in transport under transport because the genes annotated to distinct child terms; vesicle -mediated transport, protein targeting, transmembrane transport, are VERY different in term of their i) viability ii) species distribution iii) number of interaction partners iv) copy number v) expression pattern, so it may not make sense to lump them together in your slim set. This is important if you are using a slim to display the results of an enrichment, for example. GO slimming High level view of GO (genes annotated to granular terms are mapped to higher level terms) Allows users to group genes into broader categories to assess their distribution, for genome wide analyses or smaller gene sets Different Annotation groups have created specific GO slims are available at GO’s FTP site (pombe now has an “official GO slim” which give good coverage of high level processes). You can create and use your own GO slim with high level terms of interest CARE: not a gene product count, as gene products have multiple annotations this means that it doesn’t make sense to display this information as a pie chart

16 Fission Yeast Computing Workshop -16- You can cut and paste these terms from here: http://www.sanger.ac.uk/Projects/S_pombe/GO_slim pombe biological Process GO slim GO:0006810 transport (819) GO:0055085 transmembrane transport (305) GO:0006913 nucleocytoplasmic transport (116) GO:0016192 vesicle-mediated transport (277) GO:0006605 protein targeting (164) GO:0006259 DNA metabolic process (310) GO:0006310 DNA recombination (100) GO:0006281 DNA repair (155) GO:0006260 DNA replication (154) GO:0006486 protein amino acid glycosylation (68) GO:0030163 protein catabolic process (229) GO:0006412 translation (594 includes RNA) GO:0006457 protein folding (86) GO:0032446 protein modification by small protein conjugation or removal (155) GO:0016070 RNA metabolic process (914) GO:0006399 tRNA metabolic process(127) GO:0016071 mRNA metabolic process (214) GO:0032569 transcription (447) GO:0032569 specific transcription from RNA polymerase II promoter (139) GO:0006996 organelle organization (791) GO:0007005 mitochondrion organization (230) GO:0042254 ribosome biogenesis (232) GO:0007165 signal transduction (386) GO:0000747 conjugation with cellular fusion (106) GO:0030437 ascospore formation (96) GO:0007010 cytoskeleton organization (215) GO:0006950 response to stress (694) GO:0051186 cofactor metabolic process (137) GO:0006629 lipid metabolic process (203) GO:0006766 vitamin metabolic process (59) GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (131) GO:0005975 carbohydrate metabolic process (226) GO:0006725 cellular nitrogen compound metabolic process (202) GO:0006091 generation of precursor metabolites and energy (128) GO:0006520 amino acid metabolic process (191) GO:0000910 cytokinesis (141) GO:0007059 chromosome segregation (189) GO:0007346 regulation of mitotic cell cycle (162) GO:0007047 cell wall organization (63) GO:0042546 cell wall biogenesis (72) GO:0006461 protein complex assembly (111) GO:0007126 meiosis (173) GO:0007163 establishment or maintenance of cell polarity (60) GO:0019725 cellular homeostasis (101) GO:0016568 chromatin modification (209) (Children of broader terms are indented) Other processes not in the slim (under 100, work in progress) Process unknown (i.e. annotated only to the root node “biological process” (897) GO slimming, here’s one I made earlier........

17 Fission Yeast Computing Workshop -17- Exercise 3a “GO Slimming” create a “GO slim” This exercise uses the generic “GO slim mapper”at Princeton to create a ‘GO slim distribution from our gene set of interest. Go to http://go.princeton.edu/cgi-bin/GOTermMapperhttp://go.princeton.edu/cgi-bin/GOTermMapper (or Google “Princeton generic GO term mapper”) 1. Upload the protein coding gene list from Exercise 1 Select GeneDB S pombe (Generic GO Slim), 2.User defined GO slim In the advanced options (Use the pombe GO slim as a starting point and add your own terms of interest)

18 Fission Yeast Computing Workshop -18- This exercise uses the GeneDB AmiGO “GO slimmer” to create a ‘GO slim distribution from our gene set of interest. Go to http://www.genedb.org/amigo-cgi/slimmer (or Google “AmiGO GeneDB GO slimmer”) 1. Upload a gene list from the data mining exercise or the complete gene list Select ”Fission yeast GO slim” 2.User defined GO slim In the advanced options (Use the pombe GO slim as a starting point and add your own terms of interest) 1 2 Exercise 3b “GO Slimming” create a GO slim

19 Fission Yeast Computing Workshop -19- For most purposes this slim would be inadequate, but it does show “unknown” (unannotated) “other” annotated to some other term in the slim (AmiGO and the Princeton Term mapper should show these soon) There are usually many more annotations than genes (i.e 8454 here, and this will increase as you add more terms). Many genes are annotated to multiple high level term (I.e. there are intersections between many terms). A pie chart does not show the percentage of the genome involved in a particular process as it is often used and interpreted. Histograms with absolute numbers on the axis rather than percentages are much more meaningful. To research a user defined slim to ensure you have good coverage, and to check intersections between your chosen terms, use the entire protein set in the Boolean query history, and subtract: Genes with no GO process annotation Followed by your GO terms of interest (your reminder will be annotated terms which are not covered by your GO slim) If you are slimming and comparing to the complete annotation, be aware that this includes annotations to tRNAs and rRNAs etc, not just proteins “GO slimming, important considerations”

20 Fission Yeast Computing Workshop -20- Exercise 4: Create a user defined gene set Return to the Boolean Query Interface This will be used as an input set to the GO “enrichment” tool. Use a combination of searches but try to make your set contain between 500 and 200 gene products Things you can include in your Boolean query are Curation (already used) Genes of a certain type (protein coding etc) already used Annotation status (characterised, orphan etc) Specific GO function, process and component Any GO annotation, or any GO annotation to a specific aspect A specific Pfam domain, or any Pfam domain Any range of exon number, molecular mass or protein length Presence of signal peptides, GPI anchors or transmembrane domains Note: you can select for the absence of features using a subtraction query All GO terms and Pfam domain names are listed alphabetically so it helps if you know hoe the term you are looking for is worded before you start: Remember when you search GO that: i) A gene product annotated to a term is automatically annotated to ALL of its parents ii) A search on a GO term returns annotations to ALL children of that term The list of GO process terms on page 16 may be a useful starting point. If you would like to use more granular terms you can browse for children of these in AmiGO Download your results set to use as input to the enrichment exercise (See exercise 1 for the download instructions)

21 Fission Yeast Computing Workshop -21- Exercise 5 “GO Term Enrichment” Using the generic “GO term finder” tool at Princeton to provide an enrichment analysis (significant shared terms) in a gene set of interest. Go to http://go.princeton.edu/cgi-bin/GOTermFinderhttp://go.princeton.edu/cgi-bin/GOTermFinder 1. Upload your gene list from the Exercise 4. 2. Select the process ontology 3. Choose the pombe association file (annotations) 1 2 3 The results will show the most significant terms in your gene set, in order of significance. The % in your gene set compared to the % in the genome as a whole is provided, in addition to the P-value

22 Fission Yeast Computing Workshop -22- Results are provided online as html tables And can be downloaded locally. Results are also presented as a DAG which allows users to browse the results set in the context of the GO hierarchy “GO Term Enrichment”, important considerations In the advanced options is the option to upload the list of genes for your background population. This is especially important as the significance needs to be calculated form the set of genes in your experiment, not the genome as a whole. Even if you have used the entire genome in your Experiment, you should still upload the gene list incase the gene set has changed. Also, the complete set of annotations includes tRNAs rRNAs etc, and their GO annotations. If your experiment does not include these, and you do not upload your own lists your significance for some terms (e.g. translation) could be very distorted. For other important considerations for enrichment (and slimming), see Use and misuse of the gene ontology annotations. Rhee SY, Wood V, Dolinski K, Draghici S. Nat Rev Genet. 2008 Jul;9(7):509-15

23 Fission Yeast Computing Workshop -23- Exercise 6: GO coverage query You can recreate this by doing Boolean queries for specific components (FPC and combining them in the query history to generate the overlaps. Or, you can download the 3 gene lists (any Component, and Function and any Process) and import them into the online Venn diagram generator at the Url below

24 Fission Yeast Computing Workshop -24- http://www.sanger.ac.uk/Projects/S_pombe/download.shtml The contigs or chromosomes in EMBL format are the files you can use to browse the data with the Artemis sequence viewer. Each ftp directory contains a README file describing the file content and format. Make sure you consult this before downloading the data.

25 Fission Yeast Computing Workshop -25- http://www.sanger.ac.uk/Projects/S_pombe/genome_stats.shtml These data are regularly updated Where possible links are provided to the data described.


Download ppt "Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using GeneDB and the Gene Ontology annotation Basic searching."

Similar presentations


Ads by Google