Download presentation
Presentation is loading. Please wait.
Published byGwen Morrison Modified over 9 years ago
1
MAKER Annotation Process Example of Glossina VectorBase http://www.vectorbase.org Karyn Mégy Dan Hughes
2
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Annotation: aims and means Aims –Preliminary –Locus rather than exact position Means –Automatic annotation By similarity Ab initio –Manual annotation By regions By gene families
3
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Annotation: similarity vs. ab initio Similarity –Similarity to known sequences -> only know genes -> based on available data (qty, qlty) Ab initio –Follow a gene “recipe” -> potentially identify new genes -> over predictions
4
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Ensembl annotation Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation 5 4 3 1 2 1 1 1 1 2 2 3 3 4 4 5 5 Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation 5 4 3 1 2 1 1 1 1 2 2 3 3 4 4 5 5 Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation 5 4 3 1 2 1 1 1 1 2 2 3 3 4 4 5 5 Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation 5 4 3 1 2 1 1 1 1 2 2 3 3 4 4 5 5 Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation 5 4 3 1 2 1 1 1 1 2 2 3 3 4 4 5 5 Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Protein ‘Close’ species Transcriptome species specific Protein species specific Community annotation 5 4 3 1 2 1 1 1 1 2 2 3 3 4 4 5 5 Raw genome sequence MASKED genome sequence Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Community Annotation 1 Protein species specific 2 Transcriptome species specific 3 Protein ‘close’ specific 4 Ab initio 5
5
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Ensembl annotation Similarity-focused Data rich organisms Fiddly, time consuming Rhodnius prolixus experience In the meantime: Heliconius annotation using MAKER
6
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 MAKER Aim: –Generate gene sets –Combine into final gene set Iterative process http://www.yandell-lab.org/software/maker.html Cantarel et al. Gen. Res. 2008. PMID 18025269 Raw genome DAT A Annotated genome
7
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 MAKER Aim: –Generate gene sets –Combine into final gene set Iterative process Raw genome DAT A Annotated genome
8
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data - ESTs - from GenBank - cleaned and clustered/assembled with CAP3 - 71,700 contigs - Insecta/metazoa proteins - from UniProt - align to the genome with BLAST - 690,000 seqces (insecta) - 2,200,00 seqces (metazoa)
9
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data - RNAseq Illumina Yale - cleaned - aligned to the genome using Tophat/Bowtie - build ‘tranfrag’ with Cufflinks - 78,000 ‘transfrag’ (on 4 sets -> overlaps) - Augustus - generated by Martin Swain - trained with SOLiD data - 16, 963 models – high quality Gene models
10
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data Ab initio - ESTs – aligned to the genome - from GenBank – clustered with CAP3 - 71,700 clusters - Insecta/metazoa proteins (UniProt) - 690,000 seqces (insecta) - 2,200,00 seqces (metazoa) - RNAseq Illumina Yale – using Tophat/Cufflinks - 78,000 ‘transfrag’ (on 4 sets -> overlaps) - Augustus – SOLiD data trained - 16, 963 models – high QC - SNAP – trained for Glossina (MAKER) - Augustus – trained for Glossina (Martin Swain) - GenScan Gene models
11
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Intermediate gene sets Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data Ab initio Gene models
12
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 MAKER Masking: RepeatModeler repeats + known repeats/transposons Raw genome Masked genome Raw data Ab initio Gene models ESTs Proteins Provided as input Run software within MAKER
13
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 MAKER – iterative process Round-1: –Align ESTs and Insecta proteins to the genome –Train SNAP (1): Drosophila HMM ESTs and protein alignments, RNA-seq Illumina Yale, Augustus (SOLiD) Round-2: –Re-train SNAP (2) – same as above but HMM = output of SNAP-1 Round-3: –Re-train SNAP (3) – same as above but HMM = output of SNAP-2 –Align Metazoa proteins to the genome –Combine final gene set
14
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Using MAKER for… Heliconius Tsetse fly Salmon louse Centipede
15
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Annex…
16
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 Augustus (SOLiD) Martin Swain’s stats, July 22 nd, 2011 Glossina trained: > ESTs only: 14,739 predictions, 9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity) -> ESTs + SOLiD: 14,739 predictions, 9.9% with similarity to Gl. proteins (1,465 seq., 95% ID) -> Glossina GenBank proteins: 2,754 proteins sequences 53% matching Augustus models Glossina un-trained: -> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches)
17
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 ESTs Total: 79,292 ESTs
18
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 [1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol. 2003. Lehane et al. [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol. 2008. Lehane et al. [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol. 2006 Attardo et al. [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished 2009. …., Lehane,M., Hertz- Fowler,C., Berriman,M., … [5] Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished. 2009. Hertz-Fowler,C., Aslett,M.A. and Berriman,M. EST submitted under: GenomeProject:9563
19
VectorBase http://www.vectorbase.org Hinxton Developer Meeting February 2012 MAKER – final gene set Genes: –Final genes: 12,220 –Raw data: EST-based genes: 23,469 Protein-based genes : 416,9591 (redundancy) –Gene sets: Illumina-Yale: 70,915 (redundancy) Augustus (SOLiD): 16,155 –Ab initio SNAP: 48,464 Augustus (MAKER): 14,413 (417,000)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.