Presentation is loading. Please wait.

Presentation is loading. Please wait.

Assembly & Annotation at iPlant

Similar presentations


Presentation on theme: "Assembly & Annotation at iPlant"— Presentation transcript:

1 Assembly & Annotation at iPlant
Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz (CSHL) Roger Barthelson (CSHL) Cantarel et al Genome Research 18:188 Holt & Yandell BMC Bioinformatics 12:491

2 Maize Genome Project Genome Strategy 9 PI’s 2500 Mb 10 chromosomes
3-yr NSF funded project -- $30 M Mapping U. Arizona Genome 2500 Mb 10 chromosomes 50,000 genes Strategy BAC-by-BAC 17,000 clones Finish genic regions 9 PI’s FPC map Min. tiling path BAC selection Sequencing Washington U. 6X shotgun Auto finish Manual finishing GenBank c Annotation Repeat analysis Gene prediction Database Browser CSHL Maizesequence.org

3 Technology Lowering Barriers

4 Assembly & Annotation at iPlant

5 Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534.
Complexity of Genomes Science Nov 20;326(5956): doi: /science

6 Assembling a Genome 1. Shear & Sequence DNA
2. Construct assembly graph from overlapping reads …AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC CAACCTCGGACGGACCTCAGCGAA… 3. Simplify assembly graph 4. Detangle graph with long reads, mates, and other links

7 Ingredients for a good assembly
Coverage High coverage is required Oversample the genome to ensure every base is sequenced with long overlaps between reads Biased coverage will also fragment assembly Read Coverage Expected Contig Length Read Length Reads & mates must be longer than the repeats Short reads will have false overlaps forming hairball assembly graphs With long enough reads, assemble entire chromosomes into contigs Quality Errors obscure overlaps Reads are assembled by finding kmers shared in pair of reads High error rate requires very short seeds, increasing complexity and forming assembly hairballs Amount of oversampling depends of read length, genome complexity Current challenges in de novo plant genome sequencing and assembly Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243

8 N50 size Def: 50% of the genome is in contigs as large as the N50 value Example: 1 Mbp genome N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp) Note: N50 values are only meaningful to compare when base genome size is the same in all cases 50% 1000 300 100 45 45 30 20 15 15 10 . . . . .

9 Attempt to answer the question: “What makes a good assembly?”
Organizers provided sequence data to assembly experts around the world Assemblathon 1: ~100Mbp simulated genome Assemblathon 2: 3 vertebrate genomes each ~1GB Results demonstrate trade-offs assemblers must make organized by UC Davis and UC Santa Cruz “good framing problem” Assemblathon 1: A competitive assessment of de novo short read assembly methods. Earl, DA, et al. (2011) Genome Research. doi: /gr Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species Bradnam, KR. et al (2013) GigaScience 2:10 doi: / X-2-10

10 Final Rankings organized by UC Davis and UC Santa Cruz “good framing problem” ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS My recommendation for “typical” short read assembly is to use ALLPATHS Single molecule sequencing becoming extremely attractive if you have access

11 Apps in Discovery Environment
Genome Assembly Allpaths-LG Soapdenovo2 ABySS Velvet Newbler Ray Contig analysis tools With or without reference sequence for comparison

12 Assembly Workflow Upload Reads Quality Assessment De novo Assembly
Minutes to Months Quality Assessment Minutes to Hours An unfamiliar problem with familiar data De novo Assembly Hours to Days Assembly Assessment Minutes to Hours

13 Apps in Discovery Environment
(for sequencing studies) Sequence Quality Control FastQC Fastx Toolkit Suffixerator/Tallymer/mkindex Sabre, Scythe, Sickle (paired end trimming) SGA cleanup (paired end quality trimming) Future plans Sequence induction, assessment, and trimming pipeline Mira contaminant detection and removal

14 QC: FastQC An unfamiliar problem with familiar data

15 QC: Read Coverage Reference: Reads: Errors Coverage Repeats

16 Wheat Genome (A. tauschi / CSHL)

17 QC: Mer counts Frag1.fq Frag2.fq FASTX_fastq-to-fasta
An unfamiliar problem with familiar data Suffixerator Suffixerator-Tallymer-mkindex A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes Kurtz S. Narechania A, Stein JC, Ware D. (2008) BMC Genomics. 9:517

18 Using Allpaths LG You must have at least 2 libraries
One overlapping fragment library, e.g. 100 bp reads with 180 bp spacing One jumping mate-pair library, e.g bp spacing

19 reads unipaths assembly corrected reads doubled reads localized data
How ALLPATHS-LG works reads See Youtube: corrected reads doubled reads localized data local graph assemblies global graph assembly Oversimplified, actually fifty modules developed over the past six years and a bit cluttered. Laden with opportunities for improvement! unipaths assembly Sante Gnerre et al (2010) PNAS 1513–1518, doi: /pnas

20 Where is the sample data?
ALLPATHS-LG in DE 180 bp 3500 bp Data Source: GAGE Project

21 Where is the Allpaths LG App?
ALLPATHS-LG in DE Where is the Allpaths LG App?

22 Fragment Reads ALLPATHS-LG in DE

23 Jumping Reads ALLPATHS-LG in DE

24 ALLPATHS-LG in DE Run Settings

25 Running ALLPATHS-LG An unfamiliar problem with familiar data

26 Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067.
Post-QC: CEGMA An unfamiliar problem with familiar data CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9):

27 Resources iPlant Assembly Competitions
Assembly Competitions Assemblathon: GAGE: Assembler Websites: ALLPATHS-LG: SOAPdenovo: Celera Assembler: Tools: FastQC: Tallymer: CEGMA:

28 What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Expression, repeats, transposons Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality csontrol and downstream management of genome annotations. Now while many of you are likely already familiar with this, I will explain anyway

29 Secondary Annotation Protein Domains GO and other ontologies
InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools

30 Challenges in Plant Genome Annotation
Genomes are BIG Highly repetitive Many pseudogenes Assembly contamination Incomplete evidence No method is 100% accurate

31 Options for Protein-coding Gene Annotation
Yandell & Ence. Nature Reviews Genetics 13, (May 2012) | doi: /nrg3174

32 Typical Annotation Pipeline
Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-driven prediction Chooser/combiner Evaluation/filtering Manual curation

33 MAKER-P Automated Pipeline
MPI-enabled to allow parallel operation on large compute clusters Repeat Library Ab initio prediction Evidence Collaboration with Yandell Lab

34 What is a GFF File? Generic Feature Format

35 Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).
Figure 1.MAKER-P provides automated means for QC. MAKER-P provides methods for automatic management and quality control of genome annotations, using metrics developed by the Sequence Ontology project. One of these metrics is AED. AED is calculated in the same manner as SN and SP, but in place of a reference gene model, the coordinates of the union of the aligned evidence is used instead. AED = 1 – AC, where AC = (SN + SP)/2. An AED of 0 indicates that the annotation is in perfect agreement with its evidence, whereas an AED of 1 indicates a complete lack of evidence support for the annotation. The left panel of this figure illustrates hypothetical cumulative AED distributions for 3 different annotated genomes. 95% of the annotations in a very well annotated genome, for example, have an annotation edit distance (AED) of less than 0.5 (illustrated in left panel above). This is true, for example of the human genome annotations. In the current release of the Arabidopsis annotations (TAIR10) 88% of the annotations have an AED of less than 0.5 (navy line)(see above right); thus the TAIR10 annotations are already quite good, but could be further improved. This value is increased to 98% when only TAIR10’s 4- and 5-star rated transcripts are considered in the analysis (blue line, right panel). When all TAIR10 gene models are passed to MAKER-P and processed using its update functionality to automatically revise them to better fit the evidence, AEDs drop (green line vs. navy line), indicating improvements in quality. De novo annotation with MAKER produced an annotation set in which 97% of the annotations have an AED of 0.5 (red line). Better Quality Worse

36 MAKER-P at iPlant TACC Lonestar Supercomputer PAG 2014:
22,656 CPU cores on1,888 nodes Genome Assembly Size (Mb) CPU Run Time Arabidopsis thaliana TAIR10 120 600 2:44 1500 1:27 Zea mays RefGen_v2 2067 2172 2:53 Campbell et al. Plant Physiology. December 4, 2013, DOI: /pp PAG 2014: W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species 10 rice species (each w/12 chromosome pseudomolecules) 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome

37 MAKER-P at iPlant Atmosphere: MAKER_2.28 (emi-F13821D0) Virtual image
MPI-enabled for parallel computing Check out with up to 16 CPU Tested with 4 CPU instance Completed rice chr 1 in 8 hr 45 min

38 MAKER-P Tutorial

39 Annotation Post-Analysis
AED threshold InterProScan Comparative analysis, e.g. BLAST vs RefSeq proteins

40 Annotation Post-Analysis
InterProScan

41 Assembly & Annotation at iPlant

42 Additional MAKER-P Resources
MAKER-P: lab.org/software/maker-p.html Repeat Library contstuction: iki/index.php/Repeat_Library_Construction-- Advanced Pseudogene identification: .php/Protocol:Pseudogene


Download ppt "Assembly & Annotation at iPlant"

Similar presentations


Ads by Google