Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University
Motivation Maize genome is more complex than previously sequenced genomes – Many high-copy, long, highly conserved repeats – Genome contains many NIPs (Nearly Identical Paralogs, low-copy genes that are expressed and >98% identical; Emrich et al., 2007) (= CNPs and CNV) Hence, assembling this genome presents new challenges Are existing assembly programs up to the task?
Evidence of Assembly Errors Wash U noticed examples of collapse of repeats ISU identified examples of NIP collapse
AC AT GC B73 Mo17 SNP: single nucleotide polymorphism between alleles of a single gene Paramorphism (PM): a single nucleotide substitution between paralogs Nearly Identical Paralogs (NIPs): paralogous sequences with >99% identity Terms
Paramorphisms Provide Evidence of NIPs
Frequency of NIPs Conservatively ~1% of maize genes have NIPs (Emrich et al., 2007) Inspection of assembled BACs reveals NIP clusters But in addition also detect examples of NIP collapse CNPs/CNV associated with adaptive evolution in humans (Perry et al., Nat. Genetics, 2007)
BAC Assembly, Example 1 MAGI3.1 ID: MAGI_18749 (Emrich et al., 2007) BAC ID: CH C17 Paramorphic Sites: C/T (1,175), C/T (1,293), C/T (1,359) CH C17: gi| |gb|AC (152,054 bp) GenBank 56,57255, bp
BAC Assembly Example 1 - Site #1 BAC ID: CH C17 GI: GB: AC ,054 bp MAGI_18749 Paramorphic Site #1: C/T (1,175) 2 C vs 2 T Consensus Base Paramorphic Site #1 2/7 assembled BACs known to contain NIPs exhibit evidence of NIP collapse (conservative)
Traditional Assembly Sequence alignments between reads are identified Construct contigs – Start at a good alignment – Extend ends of contig one sequence at a time Clone pair information is used to scaffold contigs after contig construction.
Our Approach Integrate clone pair data into contig assembly process Model sequence alignments & clone pairs as a graph. First, construct an alignment graph Sequence reads are nodes A black edge is drawn between a pair of nodes if there is a valid sequence alignment
Clone Pair Informed Assembly Second, introduce two addl types of edges into the graph Clone pair edges (red) Path edges (green) A path edge exists between two nodes if: they are close together in the graph AND their clone pairs are also close together Identifies assembly-relevant sequence alignments
Repeat Example
Our Approach Series of graph transformations to ensure black edges (sequence alignments) represent correct genomic overlaps, and resolve entries into and exits out of repeats. – Use clone pairs to validate alignments in repeat regions if the corresponding mate pairs are anchored to unique regions and exhibit alignment. – Use paramorphisms to break spurious alignments due to NIPs. – Use clone pairs to match entries into and exits out of repeats. – Use clone pairs and validated alignments to guide contigs. – Use graph min-cuts to find correct assignment of reads to the complementary strands. – Use graph reductions and visualization for further analysis.
Example: Use Paramorphisms to Break Spurious Alignments GTCT A CAG GTCT C CAG GTCT A CAG GTCT C CAG
Three Random Stage 3 BACs Shotgun sequences extracted from Genbank and trimmed NameReadsPost TrimCorrupt Quality Info 273D N H
273D22 Annotate paths via walking through the graph. Make use of three levels of pointers: – Black edges: show what steps are available – Green edges: indicate the best path – Red edges: indicate our final destination
273D22: Incorrect Contiging Contig 0 Contig 1 Contig 1 is a small contig in the finished BAC that contains sequences that should be attached to the end of Contig 0.
273D22: Missing Scaffold
306N19: Mis-assembly Contig 3 Contig 5 Contig 0 Contig 4 Contig 3
306N19: Complex Repeat
D396H10: Missed Scaffolding Contig 6 Contig 8 Contig 5
D396H10: Missed Scaffolding Contig 7 Contig 2 Contig 3
Identifying Assembly Errors ???
273D22: Weak Link not Corroborated by Clone Pairs Contig 3
Conclusions & Future Directions Discovered misassembled regions in all three randomly chosen BACs – Conclusions supported by multiple lines evidence (clone pair + overlap) – Mis-assemblies (e.g., repeat-induced knots; collapsed repeats & NIPs) and missed scaffolding Benefits of our approach – Can provide better assemblies Can navigate through repeats Can correctly assemble NIPs – With development could output contigs and perform scaffolding in one step – Could provide refined finishing advice – Could include a community-accessible visualization of assembled BAC contigs and supporting data (confidence levels) Longer term – Our assembly approach could be applied to whole genome assembly of maize and other complex genomes – Could incorporate paired next generation sequencing data (e.g. 454, Solexa, Solid) Needed research – Random collection of finished BACs (truth) – Develop algorithms for navigating paths through the graph – Accurately construct final contigs that contain multiple copies of repeats – Create BAC re-assembly pipeline (inform finishing efforts in future sequencing projects) – Scale approach to whole genome level