Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Carson Holt (Ontario Institute Cancer Research) Campbell et al. Plant Physiology. DOI: /pp Cantarel et al Genome Research 18:188 Holt & Yandell BMC Bioinformatics 12:491
Assembly & Annotation at iPlant
What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Expression, repeats, transposons Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology
Secondary Annotation Protein Domains InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools
Challenges in Plant Genome Annotation Genomes are BIG Highly repetitive Many pseudogenes Assembly contamination Incomplete evidence No method is 100% accurate
Options for Protein-coding Gene Annotation Yandell & Ence. Nature Reviews Genetics 13, (May 2012) | doi: /nrg3174
Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-driven prediction Chooser/combiner Evaluation/filtering Manual curation
MAKER-P Automated Pipeline Ab initio prediction Evidence MPI-enabled to allow parallel operation on large compute clusters
Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED). Better Quality Worse
W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species 10 rice species (each w/12 chromosome pseudomolecules) 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome 10 22,656 CPU cores on1,888 nodes GenomeAssembly Size (Mb) CPU Run Time Arabidopsis thalianaTAIR :44 Arabidopsis thalianaTAIR :27 Zea maysRefGen_v :53 TACC Lonestar Supercomputer Campbell et al. Plant Physiology. December 4, 2013, DOI: /pp PAG 2014: MAKER-P at iPlant
Virtual image MPI-enabled for parallel computing Check out with up to 16 CPU Tested with 4 CPU instance (rice chr 1 in 8 hr 45 min) Tutorial: sciplant/MAKER-P+Tutorial+%28Atmosphere%29 11 Atmosphere: MAKER_2.28 (emi-F13821D0)
Future Plans TACC Lonestar Supercomputer Discovery Environment GUI