August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
VectorBase BRC VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Sequence Analysis with Artemis & Artemis Comparison Tool (ACT) South East Asian Training Course on Bioinformatics Applied to Tropical Diseases (Sponsored.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Protein and RNA Families
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
InterPro Sandra Orchard.
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Protein families, domains and motifs in functional prediction May 31, 2016.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
Protein families, domains and motifs in functional prediction
The Transcriptional Landscape of the Mammalian Genome
VectorBase genome annotation
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Ab initio gene prediction
Introduction to Bioinformatics II
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Genome Annotation and the Human Genome
Presentation transcript:

August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI

August 2008Bioinformatics tools for Comparative Genomics of Vectors2 Repeats

August 2008Bioinformatics tools for Comparative Genomics of Vectors3 Genome annotation - building a pipeline Genome sequence Map repeats Genefinding Protein-coding genes Map ESTsMap Peptides nc-RNAs Functional annotation Release

August 2008Bioinformatics tools for Comparative Genomics of Vectors4 Repeat features  Genomes contain repetitive sequences GenomeSize (Mb)% Repeat Aedes aegypti1,300~70 Anopheles gambiae260~30 Culex pipiens 540~50

August 2008Bioinformatics tools for Comparative Genomics of Vectors5 Repeat features: Tandem repeats  Pattern of two or more nucleotides repeated where the repetitions are directly adjacent to each other  Polymorphic between individuals/populations  Example programs: Tandem, TRF

August 2008Bioinformatics tools for Comparative Genomics of Vectors6 Repeat features: Interspersed elements  Transposable elements (TEs)  Transposons, Retrotransposons etc  Entire research field in itself  Example programs: Repeatscout, RECON

August 2008Bioinformatics tools for Comparative Genomics of Vectors7 Finding repeats as a preliminary to gene prediction  Repeat discovery  Literature and public databanks  Automated approaches (e.g. RepeatScout or RECON)  Generate a library of example repeat sequences (FASTA file with a defined header line format)  Use RepeatMasker to search the genome and mask the sequence

August 2008Bioinformatics tools for Comparative Genomics of Vectors8 Masked sequence  Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s  Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set >my sequence atgagcttcgatagcgatcagctagcgatcaggctactattggct tctctagactcgtctatctctattagctatcatctcgatagcgatcag ctagcgatcaggctactattggcttcgatagcgatcagctagcga tcaggctactattggcttcgatagcgatcagctagcgatcaggct actattggctgatcttaggtcttctgatcttct >my sequence (repeatmasked) atgagcttcgatagcgatcagctagcgatcaggctactattxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcg atcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxx tagcgatcaggctactattggcttcgatagcgatcagctagcgat caggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

August 2008Bioinformatics tools for Comparative Genomics of Vectors9 Masked sequence - Hard or Soft?  Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked >my sequence ATGAGCTTCGATAGCGCATCAGCTAGCGATC AGGCTACTATTGGCTTCTCTAGACTCGTCTA TCTCTATTAGTATCATCTCGATAGCGATCAGC TAGCGATCAGGCTACTATTGGCTTCGATAGC GATCAGCTAGCGATCAGGCTACTATTGGCTT CGATAGCGATCAGCTAGCGATCAGGCTACTA TTGGCTGATCTTAGGTCTTCTGATCTTCT >my sequence (softmasked) ATGAGCTTCGATAGCGCATCAGCTAGCGATC AGGCTACTATTggcttctctagactcgtctatctctattagtat cATCTCGATAGCGATCAGCTAGCGATCAGGC TACTATTggcttcgatagcgatcagcTAGCGATCAGG CTACTATTggcttcgatagcgatcagcTAGCGATCAG GCTACTATTGGCTGATCTTAGGTCTTCTGAT CTTCT

August 2008Bioinformatics tools for Comparative Genomics of Vectors10 Pairwise alignments

August 2008Bioinformatics tools for Comparative Genomics of Vectors11 Genome annotation - building a pipeline Genome sequence Map Repeats Genefinding Protein-coding genes Map ESTsMap Peptides nc-RNAs Functional annotation Release

August 2008Bioinformatics tools for Comparative Genomics of Vectors12 Genefinding

August 2008Bioinformatics tools for Comparative Genomics of Vectors13 Genome annotation - building a pipeline Genome sequence Map Repeats Genefinding Protein-coding genes Map ESTsMap Peptides nc-RNAs Functional annotation Release

August 2008Bioinformatics tools for Comparative Genomics of Vectors14 More terminology  Gene prediction Predicted exon structure for the primary transcript of a gene  CDS Coding sequence for a protein-coding gene prediction (not necessarily continuous in a genomic context)  ORF Open reading frame, sequence devoid of stop codons  Similarity Similarity between sequences which does not necessarily infer any evolutionary linkage  ab initio prediction Prediction of gene structure from first principles using only the genome sequence  Hidden Markov Model (HMM) Statistical model (dynamic Baysian network) which can be used as a sensitive statistically robust search algorithm. Use of profile HMMs to search sequence data is widespread

August 2008Bioinformatics tools for Comparative Genomics of Vectors15 Eukaryote genome annotation Genome ATGSTOP AAA n A B Transcription Primary Transcript Processed mRNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing m7Gm7G Find locus Find exons using transcripts Find exons using peptides Find function

August 2008Bioinformatics tools for Comparative Genomics of Vectors16 Prokaryote genome annotation Genome STARTSTOP A B Transcription Primary Transcript Processed RNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing Find locus Find CDS Find function STARTSTOP

August 2008Bioinformatics tools for Comparative Genomics of Vectors17 Genefinding ab initiosimilarity

August 2008Bioinformatics tools for Comparative Genomics of Vectors18 Genefinding resources  Transcript  cDNA sequences  EST sequences  Other (MPSS, SAGE, ditags)  Peptide  Non-redundant (nr) protein database  Protein sequence data, Mass spectrometry data  Genome  Other genomic sequence

August 2008Bioinformatics tools for Comparative Genomics of Vectors19 ab initio prediction Genome ATGSTOP AAA n A B Transcription Primary Transcript Processed mRNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing m7Gm7G

August 2008Bioinformatics tools for Comparative Genomics of Vectors20 ab initio prediction Genome ATGSTOP AAA n A B Transcription Primary Transcript Processed mRNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing m7Gm7G

August 2008Bioinformatics tools for Comparative Genomics of Vectors21 Genefinding - ab initio predictions  Use compositional features of the DNA sequence to define coding segments (essentially exons)  ORFs  Coding bias  Splice site consensus sequences  Start and stop codons  Each feature is assigned a log likelihood score  Use dynamic programming to find the highest scoring path  Need to be trained using a known set of coding sequences  Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

August 2008Bioinformatics tools for Comparative Genomics of Vectors22 ab initio prediction Genome Coding potential ATG & Stop codons Splice sites

August 2008Bioinformatics tools for Comparative Genomics of Vectors23 ab initio prediction Genome Coding potential ATG & Stop codons Splice sites

August 2008Bioinformatics tools for Comparative Genomics of Vectors24 ab initio prediction Find best prediction Genome Coding potential ATG & Stop codons Splice sites

August 2008Bioinformatics tools for Comparative Genomics of Vectors25 Similarity prediction Genome ATGSTOP AAA n A B Transcription Primary Transcript Processed mRNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing m7Gm7G

August 2008Bioinformatics tools for Comparative Genomics of Vectors26 Similarity prediction Genome ATGSTOP AAA n A B Transcription Primary Transcript Processed mRNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing m7Gm7G Find exons using transcripts Find exons using peptides

August 2008Bioinformatics tools for Comparative Genomics of Vectors27 Genefinding - similarity  Use known coding sequence to define coding regions  EST sequences  Peptide sequences  Needs to handle fuzzy alignment regions around splice sites  Needs to attempt to find start and stop codons  Examples: EST2Genome, exonerate, genewise

August 2008Bioinformatics tools for Comparative Genomics of Vectors28 Similarity-based prediction Align Create prediction Genome cDNA/peptide

August 2008Bioinformatics tools for Comparative Genomics of Vectors29 Genefinding - comparative  Use 2 or more genomic sequences to predict genes based on conservation of exon sequences  Examples: Twinscan and SLAM

August 2008Bioinformatics tools for Comparative Genomics of Vectors30 Genefinding - manual  Manual annotation is time consuming  Annotators use specialized utilities to view genomic regions with tiers/columns of data from which they construct a gene prediction  Most decisions are subjective and tedious to document  Avoids the systematic problems of ab initio predictors and automated annotation pipeline

August 2008Bioinformatics tools for Comparative Genomics of Vectors31 Manual prediction Coding potential ATG & Stop codons Splice sites EST similarity

August 2008Bioinformatics tools for Comparative Genomics of Vectors32 Manual prediction Coding potential ATG & Stop codons Splice sites EST similarity

August 2008Bioinformatics tools for Comparative Genomics of Vectors33 Manual prediction Predict structure Coding potential ATG & Stop codons Splice sites EST similarity

August 2008Bioinformatics tools for Comparative Genomics of Vectors34 Genefinding - non-coding RNA genes  Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples  tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes  Rfam - a suite of HMM’s trained against a large number of different RNA genes

August 2008Bioinformatics tools for Comparative Genomics of Vectors35 Example system

August 2008Bioinformatics tools for Comparative Genomics of Vectors36 Overview of current annotation system Assembled genome VectorBase gene predictionsSequencing centre gene predictions Merge into canonical set Protein analysis Display on genome browser Release to GenBank/EMBL/DDBJ

August 2008Bioinformatics tools for Comparative Genomics of Vectors37 VectorBase gene prediction pipeline Blessed predictions Community submissionsManual annotations Species-specific predictions Similarity predictions Transcript based predictions Ab initio gene predictions Canonical predictions (Genewise) (SNAP) (Exonerate) (Apollo)(Genewise, Exonerate, Apollo) Protein family HMMs (Genewise) ncRNA predictions (Rfam)

August 2008Bioinformatics tools for Comparative Genomics of Vectors38 VectorBase curation database pipeline for manual/community annotation Curation warehouse db Manual annotation (Harvard) Apollo Community annotation (Community representatives) Chado-XML Chado Ensembl GFF3 Gene build db Community annotation

August 2008Bioinformatics tools for Comparative Genomics of Vectors39 Genefinding - Review  Gene prediction relies heavily on similarity data  EST/cDNA sequences are vital for genefinding  Training for ab initio approaches  Similarity builds  Validating predictions  Protein data is the predominant supporting evidence for prediction in most vector genomes  Need to be wary of predicting from predictions  Genefinding is still something of a dark art  Efforts to standardize and document supporting evidence for prediction and modifications are ongoing

August 2008Bioinformatics tools for Comparative Genomics of Vectors40 Genefinding omissions  Alternative splice forms  Currently there is no good method for predicting alternative isoforms  Only created where supporting transcript evidence is present  Pseudogenes  Each genome project has a fuzzy definition of pseudogenes  Badly curated/described across the board  Promoters  Rarely a priority for a genome project  Some algorithms exist but usually not integrated into an annotation set

August 2008Bioinformatics tools for Comparative Genomics of Vectors41 Functional annotation

August 2008Bioinformatics tools for Comparative Genomics of Vectors42 Functional annotation  Utilise known structure/function information to infer facts related to the predicted protein sequence  Provide users with results from a number of standard algorithms/searches  Provide users with cross-references (dbxrefs) to other resources  Assign a simple one line description for each gene product  This will never be comprehensive  This will always be somewhat general

August 2008Bioinformatics tools for Comparative Genomics of Vectors43 Genome annotation Genome ATGSTOP AAA n A B Transcription Primary Transcript Processed mRNA Polypeptide Folded protein Functional activity Translation Protein folding Enzyme activity RNA processing m7Gm7G Find function

August 2008Bioinformatics tools for Comparative Genomics of Vectors44 Functional annotation - protein similarities  Predicted proteins are searched against the non-redundant protein database to look for similarities  Visually assess the top 5-10 hits to identify whether these have been assigned a function  It is important to check how the function of the top hits has been assigned in order not to transfer erroneous annotations

August 2008Bioinformatics tools for Comparative Genomics of Vectors45 Functional annotation - Protein domains  Protein domains have a number of definitions based on their size, folding and function/evolution.  Domains are a part of protein structure description  Domains with a similar structure are likely to be related evolutionarily and have a similar function  We can use this to infer function (& structure) for an unknown protein be comparison to known proteins  The tool of choice here is a Hidden Markov Model (HMM)

August 2008Bioinformatics tools for Comparative Genomics of Vectors46 Protein Domain databases  InterPro  UniProt - protein database  Prosite - database of regular expressions  Pfam - profile HMMs  PRINTS - conserved protein signatures  Prodom - collection of multiple sequence alignments  SMART - HMMs  TIGRfams - HMMs  PIRSF  Superfamily  Gene3D  Panther - HMMs

August 2008Bioinformatics tools for Comparative Genomics of Vectors47 Functional annotation - Other features  Other features which can be determined  Signal peptides  Transmembrane domains  Low complexity regions  Various binding sites, glycosylation sites etc. See for a good list of possible prediction algorithmshttp://expasy.org/tools/

August 2008Bioinformatics tools for Comparative Genomics of Vectors48 Signal peptides  Short peptide sequence found at the N-terminus of a pre-protein which mark the peptide for transport across one or more membranes  e.g. SignalP

August 2008Bioinformatics tools for Comparative Genomics of Vectors49 Transmembrane domains  Simple hydrophobic regions which sit inside a membrane  Transmembrane domains anchor proteins in a membrane and can orient other domains in the protein correctly  Examples: Receptors, transporters, ion channels  Identified based on the protein composition using a simple sliding window algorithm or an HMM  e.g. Tmpred, TMHMM

August 2008Bioinformatics tools for Comparative Genomics of Vectors50 Ontologies  Use of ontologies to annotate gene products  Gene Ontology (GO)  Cellular component  Molecular function  Biological process  Sequence Ontology (SO)

August 2008Bioinformatics tools for Comparative Genomics of Vectors51 Other data to look at  Enzyme classification (EC) numbers  Phenotype information  Alleles  Gene knockouts  RNAi knockdowns  Expression data  EST libraries (source of RNA material)  Microarrays  SAGE tags

August 2008Bioinformatics tools for Comparative Genomics of Vectors52 Functional assignment  The assignment of a function to a gene product can be made by a human curator by assessing all of the data (similarities, protein domains, signal peptide etc)  This is a labour intensive process and like gene prediction is subjective  There are automated approaches (based on family and domain databases such as Panther or InterPro) but these are under-developed  Large number of predictions from a genome project remain ‘hypothetical protein’ or ‘conserved hypothetical protein’.

August 2008Bioinformatics tools for Comparative Genomics of Vectors53 Caveats to genome annotation  Annotation accuracy is only as good as the available supporting data at the time of annotation  Gene predictions will change over time as new data becomes available (ESTs, related genomes)  Functional assignments will change over time as new data becomes available (characterisation of hypothetical proteins)  Gene predictions are ‘best guess’  Functional annotations are not definitive and only a guide  If you want the annotation to improve you should get involved with whoever is (or has) sequenced your genome of interest.  For vectors you can mail with suggestions and