Fission Yeast Computing Workshop -1- Getting the most from the fission yeast genome data: A computing workshop WT Sanger Institute WT Genome Campus Hinxton.

Slides:



Advertisements
Similar presentations
MY NCBI (module 4.5). MODULE 4.5 PubMed/How to Use MY NCBI Instructions - This part of the: course is a PowerPoint demonstration intended to introduce.
Advertisements

MY NCBI (module 4.5). MODULE 4.5 PubMed/How to Use MY NCBI Instructions - This part of the: course is a PowerPoint demonstration intended to introduce.
Choose and Book Archive New functionality from November 2012.
Accessing and Using the e-Book Collection from EBSCOhost ® When an arrow appears, click to proceed to the next slide at your own pace. To go back, click.
The Maize Inflorescence Project Website Tutorial Nov 7, 2014.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Fission Yeast Computing Workshop -1- Exercise 5: Looking for overreprsented GO terms in a gene set using Onto-Express GO annotations can be used to obtain.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Protein and Function Databases
Start the slide show by clicking on the "Slide Show" option in the above menu and choose "View Show”. or – hit the F5 Key.
Quick Start Guide. This 22 page introduction to the Financial Assessment Subsystem provides the user with a visual overview of the components of the system.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
An introduction to using the AmiGO Gene Ontology tool.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Application Process USAJOBS – Application Manager USA STAFFING ® —OPM’S AUTOMATED HIRING TOOL FOR FEDERAL AGENCIES.
Working with SharePoint Document Libraries. What are document libraries? Document libraries are collections of files that you can share with team members.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Start the slide show by clicking on the "Slide Show" option in the above menu and choose "View Show”. or – hit the F5 Key.
Welcome to the Southeastern Louisiana University’s Online Employment Site Applicant Tutorial!
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Getting started on informaworld™ How do I register my institution with informaworld™? How is my institution’s online access activated? What do I do if.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using GeneDB and the Gene Ontology annotation Basic searching.
Using The Gene Ontology: Gene Product Annotation.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Start the slide show by clicking on the "Slide Show" option in the above menu and choose "View Show”. or – hit the F5 Key.
SAGExplore web server tutorial for Module II: Genome Mapping.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Adding GO for Large Datasets COST Functional Modeling Workshop April, Helsinki.
Copyright OpenHelix. No use or reproduction without express written consent1.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Page 1 Non-Payroll Cost Transfer Enhancements Last update January 24, 2008 What are the some of the new enhancements of the Non-Payroll Cost Transfer?
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Submitting Course Outlines for C-ID Designation Training for Articulation Officers Summer 2012.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Motif discovery and Protein Databases Tutorial 5.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
SAGExplore web server tutorial. The SAGExplore server has three different modules …
S. pombe Unicellular archiascomycete Diverged from S. cerevisiae Ma Size ~14 Mb, 3 chromosomes No synteny Data stored in GeneDB.
Copyright OpenHelix. No use or reproduction without express written consent1.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The TDR Targets Database Prioritizing potential drug targets in complete genomes.
Getting GO annotation for your dataset
Regulatory Genomics Lab
USAJOBS – Application Manager
Department of Genetics • Stanford University School of Medicine
Annotation Presentation
Regulatory Genomics Lab
Welcome - webinar instructions
Regulatory Genomics Lab
Presentation transcript:

Fission Yeast Computing Workshop -1- Getting the most from the fission yeast genome data: A computing workshop WT Sanger Institute WT Genome Campus Hinxton Cambridge UK March 2006

Fission Yeast Computing Workshop -2- Index In progress

Fission Yeast Computing Workshop -3- Day One Introduction Valerie Wood Module IGenome Data Access Genome Statistics and Obtaining DataMartin Aslett GeneDB Martin Aslett Module IIGene Ontology Gene Ontology StructureMidori Harris Gene Ontology UsageJane Lomax S. pombe GO annotation Valerie Wood Break Module IIIOther Databases PfamRob Finn Uniprot Viv Junker Functional genomicsJürg Bähler GRID 1& Assessing sequence similarityValerie Wood Dinner Day Two background to microarray dataminngJürg Bähler (This optional talk will be in the James Watson Room) Practical Exercises Programme

Fission Yeast Computing Workshop To provide a general overview of the S. pombe genome data 2. To provide a general overview of the model organism database GeneDB_Spombe, the Pfam protein family database, Uniprot protein database and the Gene Ontology (GO) 3. To demonstrate different ways of browsing the data: By location (genomic region) By annotation (protein family, GO annotation etc.) 4. To demonstrate effective ways to querying the data to identify: Gene sets of interest based on features, annotation, protein family etc. Perform complex integrated queries Download sequence/annotation of query results, or user defined gene sets 5. To demonstrate effective searching to identify Novel motifs and families or potential orthologs 6. To enable the location of functional data for protein families and/or potential orthologs in other species 7.To provide expertise for handling functional genomics datasets Workshop Objectives

Fission Yeast Computing Workshop Sequence and annotation updates 1 2 Domain /family Dataflow Links Interpro2Go GOA SPKWG2O Orthologs GO curation GO associations Family identification User submission Sequence analysis Literature curation Data mining Data Collection and Integration in GeneDB: 1.Data needs to be ‘collected’ 2.Updates submitted to public sequence databases 3.All coding sequences are curated to GO (function process and component) 4. Historically most information has been inferred from potential orthologs (ISS inferred from sequence similarity) 5.All similarities are curated to level of protein family and S. cerevisiae orthologs 6.Automatic (IEA inferred from electronic annotation) mappings to GO are generated from Swissprot (UniProt) keywords and Protein family (Pfam and Interpro 2GO mappings) Data Collection and Integration

Fission Yeast Computing Workshop -6- The Project Website and Data Access Martin Aslett Section Aim To describe the available S. pombe tools data and resources available from the Sanger Insitute Content Describe the data available via the S. pombe project pages Describe data files available to download Whole genome sequence, features, annotation, data mappings Describe available genome statistics Describe S. pombe mailing list Describe Gene Naming Committee pages

Fission Yeast Computing Workshop -7-

Fission Yeast Computing Workshop The contigs or chromosomes in EMBL format are the files you can use to browse the data with the Artemis sequence viewer. Each ftp directory contains a README file describing the file content and format. Make sure you consult this before downloading the data.

Fission Yeast Computing Workshop Estimated gap sizes indicate that X +/- Y kb remain to be sequenced

Fission Yeast Computing Workshop Where possible links are provided to the data described. These data are regularly updated

Fission Yeast Computing Workshop All previous postings (from December 2004 onward) may be accessed via the archives. Members can subscribe and post general S. pombe interest items or queries to the list.

Fission Yeast Computing Workshop

Fission Yeast Computing Workshop -13- GeneDB Browsing, Searching, Downloading Martin Aslett Section Aim Demonstrate the features of GeneDB, the database which houses the genome data and annotation for S. pombe and the other genomes sequenced by the PSU and its collaborators. Contents The Gene Page and how the data is integrated with other resources How to search and browse the genome for features of interest How to construct queries to retrieve specific gene sets How to download sequence and features from directed searches in specific formats

Fission Yeast Computing Workshop Full Content Search: Search all text of a page, including EC numbers, S. cerevisiae orthologs (using systematic ID, GO terms and PubMed IDs. Use double quotes.

Fission Yeast Computing Workshop -15- The Gene Page To BLAST server Curation includes: Protein family, orthology, post-translational modification, transcriptional regulation, disease associations. 1)Similar terms are grouped. 2)Curation is browseable To AmiGO

Fission Yeast Computing Workshop -16- Sequence download and feature viewing options Gbrowse: ( Generic genome browser. Interactively manipulate And display features and annotation.

Fission Yeast Computing Workshop -17- Boolean Query Interface and History Search construction: Search Results: Query History: The query history also allows you to union and intersect, but also to subtract Query options also Include: Exon number, TMM number, protein family, keyword, chromosome, Status, etc

Fission Yeast Computing Workshop -18- Browseable Catalogues Browseable Product List Pfam Catalogue Contig Maps Other browseable catalogues include Curation and SwissProt keyword

Fission Yeast Computing Workshop -19- Note: Any DNA sequence can be used to BLAST against the contigs and access the genomic region via the Artemis applet. Mapping genes and probes to genomic regions

Fission Yeast Computing Workshop -20- Download the sequence of your BLAST hits or view them in the context of the annotated genome Download and browse sequence regions Note: The use of Artemis is not covered in this workshop. However, it can be downloaded and installed locally from with manual found at the same URL. Artemis displays all curated features and annotation in the context of the DNA and its six-frame translation.

Fission Yeast Computing Workshop -21- Note: Separate feedback forms exist for biological And informatics based queries. Feedback and queries Important: Error Reporting If you see a problem with the annotation, however small, please fill in the feedback form, which is linked from every gene page. Submission and Updating of Data The feedback forms may also be used to inform us of your recent publications and to suggest GO terms. User submissions are usually processed within two weeks, meaning that your data be immediately visible and accessible via many model organism databases, multi-organism sequence databases and the GO Consortium. Remember, curation resources are limited and databases rely on the specialised knowledge of users to ensure they are accurate and up to date.

Fission Yeast Computing Workshop -22- Downloading sequence sets From the gene page :- User defined lists may also be downloaded from here. Select download format

Fission Yeast Computing Workshop -23- Position Specific Iterative BLAST (PSI-BLAST) PSI-BLAST is a tool for identifying weak but biologically relevant sequence similarities. Searching with PSI-BLAST produces a position-specific scoring matrix constructed from a multiple alignment of the top scoring BLAST responses to a given query sequence. This scoring matrix produces a profile designed to identify the key positions of conserved amino acids within a motif. When a profile is used to search a database (in subsequent PSI-BLAST iterations), it can often detect subtle relationships between proteins which are distant structural or functional homologues. These relationships are often not detected by a BLAST search with a sample query sequence.

Fission Yeast Computing Workshop -24- Gene Coordinate Changes Shortcuts to Gene Lists

Fission Yeast Computing Workshop -25- S. pombe Gene Ontology annotations Valerie Wood Section Aim The following section provides an overview of GO annotation progress for S. pombe Contents How are GO annotations made How many GO associations are there i) for each evidence code ii) in total iii) compared to S. cerevisiae How are these distributed between the 3 ontologies How many gene products have no GO annotations

Fission Yeast Computing Workshop -26- Data from Feb 06 Manual Curation Emphasis on primaryliterature (IDA, IMP, IGI, IPI, TAS) Manual inspection of sequence similarity (ISS) Computational Mappings (IEA/RCA) InterPro to GO UniProt (Swissprot keyword to GO) Pombe keyword to GO E.C. to GO IDA,IMP,IGI,IPI,TAS, (IC, NAS) 1493 PMIDs 7061 annotations 2502 individual GO terms ISS 8085 annotations IEA/RCA 7514 annotations S. pombe GO curation strategy, and number of associations Total IC931 IDA1382 IEP67 IGI778 IMP1707 IPI710 ISS8085 NAS420 ND1 RCA9 TAS1065 IEA7514

Fission Yeast Computing Workshop -27- Data from Jan 06 GO annotation progress Total, manual and electronically inferred GO associations for S. pombe compared to the total for S. cerevisiae Excludes ncRNAs Excludes annotations to unknown terms IEA associations are only included in datasets when non-redundant with manual associations, therefore the number is decreasing S. cerevisiae data provided by SGD Total number of GO associations for S. pombe

Fission Yeast Computing Workshop -28- Cellular component Molecular function GO coverage Biological process Data from Feb 06 S. pombe S. cerevisiae Gene count dubious Current total No GO term % % GO term Excludes annotations to unknown terms S. cerevisiae data provided by SGD None 484 S. pombe has a lower absolute number, and a lower percentage of genes for which there is no information about the Molecular Function, Biological Process or Cellular component.

Fission Yeast Computing Workshop -29- GO coverage S. pombeS. cerevisiae S. pombe data from Feb 06 S. pombe has a fewer absolute number of gene products with no biological process or molecular function annotations However it has a larger absolute number with no cellular component associations. This is mainly as a result of several high throughput localization studies for S. cerevisiae Accessing AmigGO from the GeneDB S.pombe front page will give current annotion totals for the 3 ontology ‘ aspects ’ : F, P and C. NB: this will only work while explicit annotations are not made to the terms ‘ molecular function unknown ’ ‘ biological process unknown ’ and ‘ cellular component unknown ’ The numbers will be slightly different in the GO consortium AmiGO browser filtered on species S. pombe at which does not currently displaywww.goneontology.org IEA annotations

Fission Yeast Computing Workshop -30- A GO slim for process terms Unknown process S. pombe 1064 S. cerevisiae 1793 Data is from June 2005

Fission Yeast Computing Workshop -31- S. ce r evisiae orthologs Pfam protein families GO data integrated with protein family and ortholog data Data from Feb 06 S. pombe Gene count 4991 dubious 68 Current gene total 4933 GO/Pfam/cerevisiae4559 NO GO/Pfam/cerevisiae 374 This data can be regenerated (or updated) using the GeneDB Boolean Query Tool To access the set of genes with S. cerevisiae orthologs, use the option ‘Search curation keywords’ and the search string‘similar to S. cerevisiae’ None 374 Pfam coverage S. pombe75% S. cerevisiae69% All families identified from sequence analysis and literature are submitted to Pfam Mainly conserved eukaryotic or fungal families of unknown function, unstudied in any organism to date. Some absent from S. cerevisiae Some of these are submitted to Pfam GO terms

Fission Yeast Computing Workshop -32- Section Aim To provide an overview of identified orthologs between S. Pombe and S. cerevisiae. To provide some background and for the identification of distant orthologs, families and motifs for orphans Sequences and characterised genes with no identified database similarities Detection of sequence similarity Valerie Wood

Fission Yeast Computing Workshop -33- S. pombe gene A S. pombe gene A’ S. cerevisiae gene A duplication paralogs speciation orthologs Assessing Sequence Similarity

Fission Yeast Computing Workshop -34- S. pombeS. cerevisiae One to one2396 One to many Many to one Many to many Mainly informational Mainly communication

Fission Yeast Computing Workshop -35- For statistically insignificant short motifs, confidence in ortholog assignment can be increased by consideration of:

Fission Yeast Computing Workshop To provide a general overview of the S. pombe genome data All exercises 2. To provide a general overview of the model organism database GeneDB_Spombe, the Pfam protein family database, Uniprot protein database and the Gene Ontology (GO) Exercise 1 (GO annotation (AmiGO)/ YOGY) Exercise 2 (GO, GeneDB download) Exercise 3 (Complex GO queries) Exercise 4 (Integrated Query) Exercise 6 (Pfam) 3. To demonstrate different ways of browsing the data: By location (genomic region) By annotation (protein family, GO annotation etc.) Exercise 2 (GO, GeneDB download) Exercise 6 (Pfam) 4. To demonstrate effective ways to querying the data to identify: Gene sets of interest based on features, annotation, protein family etc. Perform complex integrated queries Download sequence/annotation of query results, or user defined gene sets Exercise 2 (GO, GeneDB, download) Exercise 3 (Complex GO queries, download) Exercise 5 (Integrated Query) Exercise 6 (Pfam) 5. To demonstrate effective searching to identify Novel motifs and families Potential Orthologs Exercise 6 (Pfam) Exercise 7 (Identifying distant orthologs) 6.To enable the location of functional data for protein families and/or potential orthologs in other species Exercise1 (GO annotation (AmiGO)/ YOGY) Exercise 2 (GO, GeneDB download) Exercise 6 (Pfam) Exercise 7 (Identifying distant ortholog/YOGY) 7.To provide expertise for handling functional genomics dataset Exercise 4 (Onto express, identifying statistically overrepresented GO terms in gene lists) Exercises by objectives

Fission Yeast Computing Workshop -37- Exercise 1 A GO annotation exercise Tips: i) QuickGO often reports commonly co-annotated terms at the bottom of the graph ii) Consider binding term: metal-binding, ATP binding, can also capture protein-protein interactions with GO using the function term ‘protein binding’ and the IPI evidence code. iii) Use the YOGY link from the gene page to see if any GO terms applied to orthologs are applicable iv) Can the existing annotation be made more granular (specific) v)See page 36 for a description of evidence codes vi)See page 43/44 for instructions to update GO terms Find a gene(s) of interest in GeneDB. Identify the GO terms this gene is annotated to in the QuickGO (EBI GO browser) Are the parents of the annotated terms back the the root node biologically correct; or can you see ‘true path violations’ (see page 33/34)? Are the definitions biologically correct? Report any inconsistencies to the GO office Can you identify any additional GO terms for this gene(s) based on sequence similarity, published information or incomplete annotation? Use the ‘curator feedback form’ on the Gene page to submit GO ID, evidence code and PMID (if experimentally supported) Can you replace any NAS, IC evidence codes with experimental codes

Fission Yeast Computing Workshop -38- Exercise 2Identifying and downloading gene sets Identify the term for the Dam1 complex in AmiGO Identify the list of S. pombe gene products annotated to this term Q How many are there? Set the species filter and data source to all, Q Which other organisms have the DASH complex? Follow the link to GeneDB from one of the gene products Use the ‘others’ column in the GO data section of the gene page to access the complete list of S. pombe subunits Send your dataset to the list download (see page 26) Save a Fasta file of 100 bp of DNA 3’ to the ORF Tips Remember when you search GO that: 1. A search on a GO term returns annotations to ALL children of that term 2. A gene product annotated to a term is automatically annotated to ALL of its parents You can access lists of gene products annotated to common terms using the AmiGO GO browser. You can also access gene products annotated to terms of interest in other species using the AmiGO GO browser. However, the GeneDB Boolean Query tool allows you to perform more complex queries, (for example addition, subtraction, union or intersect) of datasets of common GO terms, or even GO terms with other features.

Fission Yeast Computing Workshop -39- Use the Boolean Query tool to access all genes annotated to the GO process term ‘chromatin remodelling’ AND the GO component term ‘chromatin’ (intersect) Exercise 3 Performing complex GO queries You need to select the AND operator first, then select ‘genes with a specific GO process’ and ‘genes with a specific GO component’ as your query options Select ‘Proceed to next step’ Scroll to the bottom of your results and visit the history page Do separate Boolean queries for the function term ‘transcription factor activity’ and the component term ‘nucleus’ (do not use any operators) Visit the ‘history page’ link at the bottom of the results page Q How many of your gene products are annotated to: i)chromatin remodelling AND ii)chromatin OR (union) iii)transcription factor activity BUT NOT iv) Nucleus Q Why are these geneproducts not annotated to nucleus? Download the results set in tab delimited format Q Should any of these be annotated to nucleus? Tips You can perform this 2nd query using the query history. First union your query for i) ii )with iii) and subtract iv) ‘nucleus’ from this. When using the ‘subtraction’ option in the history, you need to perform your queries in the correct order Use YOGY to see if the orthologs of any of these gene products are known to be nuclear Query description The Boolean Query Interface

Fission Yeast Computing Workshop -40- Exercise 4: Looking for over-repersented GO terms in a gene set using Onto-Express GO annotations can be used to obtain functional data about gene sets e.g. from gene expression experiments. There are several tools available to perform this sort of analysis, all developed by groups outside of the GO Consortium. They work in a similar way: a full gene set is uploaded, with a subset of all ‘interesting’ genes, usually those that have been up- or down-regulated in an expression experiment. The tool then determines which GO categories have been enriched for ‘interesting’ genes, and provides some sort of statistical measure to guard against GO categories that appear by chance alone. For a full list of these tools, see: This tutorial will be using one such tool, Onto-Express, one of a package of microarray tools, Onto-Tools, developed at the Intelligent Systems and Bioinformatics Laboratory, Wayne State University ( First you need to obtain an Onto-Express login. Go to: and fill in the details. Your password will be ed straight to you. Once your password has arrived, go to to the login page: fill in your login details and click ‘Submit’. You will see a security pop-up, choose ‘Grant Always’. A second pop-up will appear: Choose Onto-Express and click Run.

Fission Yeast Computing Workshop -41- Note: You will need to leave the original browser window open for the whole of your session. We have provided some test microarray data for this tutorial; to download it, go to: and download both files (sp_changed.txt and sp_total.txt). The input file format is a simple text file with either accession numbers, cluster identifiers or probe identifiers (but not a combination), each listed on a separate line. Return to your input window, which should look like this: Tip: to download the files, right-click and choose ‘Save Link As’. Remember where you saved it to! Click the ‘Input File’ button and browse for the file sp_changed.txt that you’ve just downloaded. This file contains a list of genes that were under- or over-expressed in the experiment. Now choose ‘schizosaccharomyces pombe’ from the Organism menu, and for Input Type choose ‘sanger genedb’; this chooses the format of the gene list. From the Reference Array menu choose ‘My own array’ and then click the Reference file button to browse for the file sp_total.txt. This file contains a list of all of the genes on your chip. Leave all the other setting as they are and click ‘Submit’. If you used a commercial chip in your experiment, you can choose this from the Reference Array list rather than uploading your own reference file.

Fission Yeast Computing Workshop -42- After a minute or so, a results page will appear: Click the ‘Tree View’ tab. The number in bold following the term is the number of genesfrom the ‘interesting’ subset that are annotated to this term, and its child terms, in the same way as AmiGO. Open the nodes ‘biological process’ -> ‘development’ -> ‘morphogenesis’ -> ‘cellular morphogenesis’ -> ‘extablishment and/or maintenance of cell polarity’. You will see a list of genes annotated directly to ‘extablishment and/or maintenance of cell polarity’. Clicking on the gene names gives you the option of viewing the gene details. Now click the ‘Syncronized View’ tab. This shows you a flat view of all open nodes in the tree view. Q: Which is the only GO category that has a significantly different number of genes associated than you would expect? Why do you think this is? Q: What processes, functions and cellular components seem to be associated with these microarray data? From the Display options in the top right, choose Display ‘Biological Process’ and Sort by ‘Total’. Total is the number of genes associated with the GO term. Q: Which GO biological process term has the most genes associated with it? Is the over-representation of this term statistically significant? Q: Which molecular function and cellular component GO categories have the most genes associated with them? Hint: To switch between which ontology is shown in the flat view, use the Display options.

Fission Yeast Computing Workshop -43- Exercise 5 Integrated query The chart below was created from February’s data Has the number of gene products with NO S. cerevisiae ortholog, OR NO Pfam family, OR NO assigned GO term changed (i.e. Set1 : 374) ? Use the Boolean query tool and query history to UNION A, B and C and SUBTRACT from the total number of protein coding genes How many of these are ‘sequence orphans’ or ‘S. pombe specific families’ ? Has the number of Pfam protein families not assigned a GO term and without a S. cerevisiae ortholog changed? How many of set2 are conserved in fungi/eukaryotes (view list) For the eukaryotically conserved set, does YOGY provide any functional clues? Tips To access the set of genes with S. cerevisiae orthologs, use the option ‘Search curation keywords’ and the search string ‘similar to S. cerevisiae’ Don’t forget to take ‘dubious’ genes into account (under ‘Annotation Status’) For subtraction queries you need to think about the order of your search Set 2: 40? GO terms S. ce r evisiae orthologs Pfam protein families Set 1: None 374?

Fission Yeast Computing Workshop -44- Exercise 6 Using Pfam Contents of The Tutorial Section 1 - This section takes you through the contents of a Pfam family page Section 2 - Searching Pfam Section 3 - Exploring Pfam Clans Section 4 - Taxonomy Queries Section 5 - Domain Combination Queries Section 6 - Genome Comparison Section 7 - Creating and Editing Pfam domain graphics Pfam online tutorial

Fission Yeast Computing Workshop -45- Exercise 7 Identifying distant orthologs and sequence similarities Can you detect orthologs for any of these orphan proteins: SPAC17H9.06 SPCC553.01c SPAPB8E5.10 SPBC428.17c SPBC947.14c Identify the missing S. pombe subunit of the conserved CCAAT-binding factor complex, orthologous to S. cerevisie HAP4/YKL109W Can you identify any functional information for your predicted orthologs? Tips Follow the protocol on the following pages All of these genes have orthologous relationships or have been identified as members of protein families which have not yet been annotated. Most will not be identified by a simple Blast search

Fission Yeast Computing Workshop -46- Distant ortholog detection protocol Do any of the main orthology identification tools identify a potential ortholog which has been missed by manual inspection? (Use the YOGY link from gene page) Your protein may not have a Pfam-A domain, but check the Pfam link to see if there is a Pfam-B (see over the page how to access) Perform a Blast search against the S. cerevisiae protein set using the GeneDB Omniblast server. (Use the link from the Gene page) Note any candidates Tip The Pfam domain organization view can also provide other clues; the location of low-complexity regions, coiled-coil-regions Transmembrane regions and predicted signal sequences Tip The remaining unidentified orthologs will not usually be ‘reciprocal best hits’ (i.e. top hits in the respective species). You can probably dismiss hits where the orthologs are already clearly identified and well conserved. You are looking for ‘corresponding regions of similarity’ i.e similar protein length, co-linear HSPs, conserved N or C -terminal regions. Perform a PSI-Blast search against the UNIPROT protein set using the GeneDB PSI-BLAST server. (Use the link from the gene page) Reiterate to convergence. Do you have a S. cerevisiae candidate? Note the accessions of your other hits. They are likely to be from other fungi which are phylogenetically inbetween the two yeast. Search with the closest to S. cerevisiae.Do you have a S.cerevisiae candidate? Note the UniProt accession numbers Of your candidates and multiply align. (see the following section to Make a FASTA file and run Clustalw). Are key residues and regions conserved? Is the species distribution expected? yes no Do you obtain any additional hits? Are the 2 sets overlapping? If you have no S. cerevisiae hit, it is possible your protein is fungally conserved but absent from S. cerevisiae yes no PTO…

Fission Yeast Computing Workshop -47- Section in progress v Clues from Pfam, when no PfamA domains identified Still Nothing? Try again at a later date. Remember Pfam rebuilds PfamBs every release, and new genomes are continually sequenced and added to Uniprot and YOGY You could also search Pfam with a reduced threshold (increased probablity) to see if any less conserved domains can be detected Has your gene predicition been validated? perhaps it contains errors which are affecting similarity detection… Tip Consider searching with any S. cerevisiae PSI Blast hits below ‘threshold’. Hits to Asbya gossypii are particularly useful as syntenic S. cerevisiae orthologs are identified for most genes. Confirm at the Ashbya Genome Database (AGD)

Fission Yeast Computing Workshop To retrieve FASTA format sequences Tip Select sequence library UniProt and ‘standard query form’then…. Use your FASTA format file as input Into ClustalW. Switch on Clustal colours to see conserved amino acids more clearly Pipe separated UniProt IDs

Fission Yeast Computing Workshop -49- Fungal phylogeny This fungal phylogeny will help decisions which candidate hits to search with. Hits closer to S. cerevisiae are most likely to give good PSI-BLAST alignments for position specific scoring because of the larger number of closely related genomes available