MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.

Slides:



Advertisements
Similar presentations
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
Glossina Transcriptome Annotation Karyn Megy, VectorBase European Bioinformatics Institute, UK.
Annotating a Scarlet Runner Bean genome fragment put together by shotgun sequencing Scarlet Runner ean Max Bachour.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Transcriptomics Jim Noonan GENE 760.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Comparative ab initio prediction of gene structures using pair HMMs
How to access genomic information using Ensembl August 2005.
VectorBase BRC VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton.
Eukaryotic Gene Finding
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
The Ensembl Gene set The “Genebuild” 21 April 2008.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
VectorBase Seth Redmond Imperial College, London
Tomato genome annotation pipeline in Cyrille2
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
NCBI Vector-Parasite Genomic Related Databases Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 12, 2004
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Genomics of Microbial Eukaryotes Igor Grigoriev Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
Overview and History of VectorBase Frank Collins March 31, 2015.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Canadian Bioinformatics Workshops
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
Annotating The data.
Introduction to Genes and Genomes with Ensembl
VectorBase genome annotation
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome Sequence Annotation Server
ENCODE Pseudogenes and Transcription
Genome Sequence Annotation Server
Genome Annotation w/ MAKER
A web-based platform for structural and functional annotation of model and non-model organisms Jodi Humann, Taein Lee, Stephen Ficklin,
The transcript profiles in the three human cell lines based on RNA sequencing (RNA‐seq). The transcript profiles in the three human cell lines based on.
Follow-up from last night: XSEDE credits
Schematic representation of a transcriptomic evaluation approach.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes

VectorBase Hinxton Developer Meeting February 2012 Annotation: aims and means Aims –Preliminary –Locus rather than exact position Means –Automatic annotation By similarity Ab initio –Manual annotation By regions By gene families

VectorBase Hinxton Developer Meeting February 2012 Annotation: similarity vs. ab initio Similarity –Similarity to known sequences -> only know genes -> based on available data (qty, qlty) Ab initio –Follow a gene “recipe” -> potentially identify new genes -> over predictions

VectorBase Hinxton Developer Meeting February 2012 Ensembl annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Community Annotation 1 Protein species specific 2 Transcriptome species specific 3 Protein ‘close’ specific 4 Ab initio 5

VectorBase Hinxton Developer Meeting February 2012 Ensembl annotation Similarity-focused Data rich organisms Fiddly, time consuming Rhodnius prolixus experience In the meantime: Heliconius annotation using MAKER

VectorBase Hinxton Developer Meeting February 2012 MAKER Aim: –Generate gene sets –Combine into final gene set Iterative process Cantarel et al. Gen. Res PMID Raw genome DAT A Annotated genome

VectorBase Hinxton Developer Meeting February 2012 MAKER Aim: –Generate gene sets –Combine into final gene set Iterative process Raw genome DAT A Annotated genome

VectorBase Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data - ESTs - from GenBank - cleaned and clustered/assembled with CAP3 - 71,700 contigs - Insecta/metazoa proteins - from UniProt - align to the genome with BLAST - 690,000 seqces (insecta) - 2,200,00 seqces (metazoa)

VectorBase Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data - RNAseq Illumina Yale - cleaned - aligned to the genome using Tophat/Bowtie - build ‘tranfrag’ with Cufflinks - 78,000 ‘transfrag’ (on 4 sets -> overlaps) - Augustus - generated by Martin Swain - trained with SOLiD data - 16, 963 models – high quality Gene models

VectorBase Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data Ab initio - ESTs – aligned to the genome - from GenBank – clustered with CAP3 - 71,700 clusters - Insecta/metazoa proteins (UniProt) - 690,000 seqces (insecta) - 2,200,00 seqces (metazoa) - RNAseq Illumina Yale – using Tophat/Cufflinks - 78,000 ‘transfrag’ (on 4 sets -> overlaps) - Augustus – SOLiD data trained - 16, 963 models – high QC - SNAP – trained for Glossina (MAKER) - Augustus – trained for Glossina (Martin Swain) - GenScan Gene models

VectorBase Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data Ab initio Gene models

VectorBase Hinxton Developer Meeting February 2012 MAKER Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data Ab initio Gene models ESTs Proteins Provided as input Run software within MAKER

VectorBase Hinxton Developer Meeting February 2012 MAKER – iterative process Round-1: –Align ESTs and Insecta proteins to the genome –Train SNAP (1): Drosophila HMM ESTs and protein alignments, RNA-seq Illumina Yale, Augustus (SOLiD) Round-2: –Re-train SNAP (2) – same as above but HMM = output of SNAP-1 Round-3: –Re-train SNAP (3) – same as above but HMM = output of SNAP-2 –Align Metazoa proteins to the genome –Combine final gene set

VectorBase Hinxton Developer Meeting February 2012 Using MAKER for… Heliconius Tsetse fly Salmon louse Centipede

VectorBase Hinxton Developer Meeting February 2012 Annex…

VectorBase Hinxton Developer Meeting February 2012 Augustus (SOLiD) Martin Swain’s stats, July 22 nd, 2011 Glossina trained: > ESTs only: 14,739 predictions, 9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity) -> ESTs + SOLiD: 14,739 predictions, 9.9% with similarity to Gl. proteins (1,465 seq., 95% ID) -> Glossina GenBank proteins: 2,754 proteins sequences 53% matching Augustus models Glossina un-trained: -> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches)

VectorBase Hinxton Developer Meeting February 2012 ESTs Total: 79,292 ESTs

VectorBase Hinxton Developer Meeting February 2012 [1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol Lehane et al. [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol Lehane et al. [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol Attardo et al. [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished …., Lehane,M., Hertz- Fowler,C., Berriman,M., … [5] Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished Hertz-Fowler,C., Aslett,M.A. and Berriman,M. EST submitted under: GenomeProject:9563

VectorBase Hinxton Developer Meeting February 2012 MAKER – final gene set Genes: –Final genes: 12,220 –Raw data: EST-based genes: 23,469 Protein-based genes : 416,9591 (redundancy) –Gene sets: Illumina-Yale: 70,915 (redundancy) Augustus (SOLiD): 16,155 –Ab initio SNAP: 48,464 Augustus (MAKER): 14,413 (417,000)