VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012
Genome Sizes Pediculus humanus: ~110 Mb, N50 = 488 kb Anopheles gambiae S: ~260 Mb, N50 = 1,505 kb Culex quinquefasciatus: ~580 Mb, N50 = 487 kb Aedes aegypti: ~1.3 Gb, N50 = 1,500 kb Ixodes scapularis: ~1.8 Gb, N50 = 72 kb
Future genomes 4 White papers Sandflies Lutzomyia longipalpis Phlebotomus papatasi Anopheles (AGCC) Anopheles arabiensis Anopheles quadriannulatus Anopheles merus Anopheles melas Anopheles christyl Anopheles epiroticus Anopheles stephensi Anopheles maculatus Anopheles funestus Anopheles minimus Anopheles culicifacies Anopheles farauti Anopheles dirus Anopheles atroparvus Anopheles albimanus Glossina Glossina palpalis Glossina fuscipes Glossina pallidipes Glossina brevipalpis Glossina austeni Stomoxys calcitrans Musca domestica Simulium Simulium vittatum Simulium sirbanum Simulium damnosum Simulium ochraceum Simulium squamosum Simulium thyolense Simulium santipauli Simulium woodi Simulium exiguum Simulium yahense Tick & Mites Leptotrombidium deliense Ixodes scapularis* Dermacentor variabilis Ornithodorus turicata Anopheles Anopheles darlingi* Anopheles stephensi Others Aedes Aedes albopictus i5K initiative
First New Release in New Contract
Challenges of vector genomes Relatively large, hard to inbreed genomes Heterozygosity in sequencing samples (up to 80 different males were sequenced for the new gambiae genomes) causes dubious scaffolds. Inversions and heterochromatic regions induce gaps Newer generation sequencing has reduced cost but has not yet kept overall quality Non-trivial annotations
An. gambiae forms M-form More permanent Available year-round Allows slower development Predator-rich S-form Ephemeral rainy-season dependent Requires rapid development Largely predator-free
C. Cheng et al, unpublished Divergence across chromosome arms 2L 2R X 3R 3L
Optical mapping DBP : Wisconsin
Size matters GenomeMB optically mappedgenes found S Sanger 145, S Illumina 58, PEST 60, Sanger + Ill 204,
Annotation strategies 13 Speeding up computational annotation Use of MAKER system Prediction by projection from ‘high quality’ reference Expanded use of RNA-Seq Scripture, Trinity & Cufflinks/Bowtie Community engagement Primarily deployed for new genomes (Glossina, Rhodnius) Works for all other VectorBase genomes
14 de novo annotation MAKER with RNA-Seq & reference proteomes Aim: Gene prediction pipeline for the masses. Used for a number of arthropod genome projects Touted as the default pipeline for many more (part of the GMOD toolkit) Overview ab-initio gene predictions from SNAP, Augustus & FGENESH Final gene models from MAKER EST alignments from both EXONERATE and BLASTN Protein alignments from EXONERATE and BLASTX Repeats from RepeatFinder & RepeatMasker Additional data sets integrated via GFF3 files (RNA-Seq) Uses MPI for parallelization over a compute farm Optimization for long scaffolds Summary Iterative runs give acceptable reference gene sets. Used for Glossina and An. stephensi Used by others for Strigamia, Manduca, published ant genomes
15 Community annotation Simple tool to capture community annotation Makes gene prediction and evidence available as GFF3 Compatible with Artemis and Apollo tools Submissions in GFF3 format Gene structure corrections Gene meta data (symbol, description, citations) Glossina annotation effort (Nov 11 – Apr 12) 790 GFF submissions 2670 items of metadata gene symbols, descriptions Structure confirmation
16 ARTEMIS APOLLO scf ptn2genome ptn_match ID=xxxx;Name=tr|Q3UIQ2| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3TIU7| scf ptn2genome ptn_match ID=xxxx3;Name=sp|Q91VD9| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3VIU732| scf ptn2genome ptn_match ID=xxxx;Name=tr|Q3UIQ2| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3TIU7| scf ptn2genome ptn_match ID=xxxx2;Name=tr|Q3VIU732| >MY SUPERCONTIG ATATATGCGTTGAGCTGCGTTACGTTCGG GATGCGTTAGGCTTGTGAGCTGGATCGGT CCTGCCTGCGTCGATATAAACGACCT… Identify gene Modify model Submit CAP GFF3 FASTA
Population biology 17 Chado Natural diversity schema 183 projects, samples incorporates Irbase samples Ensembl variation schema 1,511,335 SNP calls Visualization through browser Data downloads through browser Queries via BioMart interface