Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1
WGS sequencing Multiple copies of DNA Fragments of ,000 bases No information is retained on which part of the DNA the fragments came from. 2
WGS sequencing: fragments Sequencing machine reads bases on the ends of the fragments, producing pairs of reads. The fragment sizes are known up to ± 10-20%. CAAGCTGAT... Pair of reads Unknown sequence …GTTTGGAAC 3
The mathematical problem We start with millions of pairs of reads, bases each We start with millions of pairs of reads, bases each Multiple copies of DNA provide multiple coverage by reads Multiple copies of DNA provide multiple coverage by reads The problem of genome assembly is to recover the original sequence of bases of the genome (as much as possible…). The problem of genome assembly is to recover the original sequence of bases of the genome (as much as possible…). 4
Assembling a jigsaw puzzle 1 The task of the assembly becomes the task of assembling a giant jigsaw puzzle The task of the assembly becomes the task of assembling a giant jigsaw puzzle We look for reads whose sequences suggest that they came from the same place in the genome: AGTGATTAGATGATAGTAGA |||||||||||| GATGATAGTAGAGGATAGATTTA We look for reads whose sequences suggest that they came from the same place in the genome: AGTGATTAGATGATAGTAGA |||||||||||| GATGATAGTAGAGGATAGATTTA 5
Assembling a jigsaw puzzle 2 Then we put “overlapping” reads together Then we put “overlapping” reads together AGTGATTAGATGATAGTAGA AGATGATAGTAGAGATAGATAGACC AGATGATAGTAGAGATAGATAGACC ATAGATAGACCACTCATCATAC ATAGATAGACCACTCATCATACAGTGATTAGATGATAGTAGAGATAGATAGACCACTCATCATAC reads This yields a “contig” 6
Assembling a jigsaw puzzle 3 We use read pairing information to order and orient contigs to produce scaffolds – the final product of assembly We use read pairing information to order and orient contigs to produce scaffolds – the final product of assembly Pairs of reads belonging to the same fragment of DNA contig 7
Difficulties in assembly Sequencing errors: two reads that came from the same place in the genome often have mismatching sequences AGTGATTAGATCATAGTAGAG || ||||||||| ATGATAGTAGAGGATAGAT Repetitive DNA (~ 5-20% of human DNA is repetitive): TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 8
Repeat regions may cause omissions ARBRC ARC 9
Erroneous duplications UMD2 BosTau4 Each base in the genome is covered by 6 reads, on average. A way to judge which assembly is correct is to compute the average read coverage for these regions. Two recent published assemblies of the cow genome: UMD2 and BosTau4 Two recent published assemblies of the cow genome: UMD2 and BosTau4 Segmental duplications were a central theme in BosTau4 genome paper Segmental duplications were a central theme in BosTau4 genome paper UMD2 assembly had many fewer duplications UMD2 assembly had many fewer duplications We examined the duplications, > 99.5% identity, >5000bp, one copy in the UMD2 assembly and two copies in the BosTau4 10
Examining read coverage reveals errors The thick solid vertical line is placed at the coverage at which it is as likely to have two copies as it is to have one. 11
Next Gen vs. Sanger Sequencing Sanger sequencing for a mammalian (~ 3Gbp) genome Sanger sequencing for a mammalian (~ 3Gbp) genome Expensive: $50M for a mammalian genome Large amount of DNA required We get bp reads all with mate pairs We get bp reads all with mate pairs Illumina and 454 Sequencing for the same genome Illumina and 454 Sequencing for the same genome Inexpensive: as low as $25K (Illumina), or $1M (454) for a mammalian genome Inexpensive: as low as $25K (Illumina), or $1M (454) for a mammalian genome Small amount of DNA required (e.g. one insect) Small amount of DNA required (e.g. one insect) Only 100 or 400 bp reads, some with mate pairs Assembly is a much harder problem now Assembly is a much harder problem now 12
Difficulties in denovo Assembly of Illumina and 454 data Reads are short – high coverage needed, imposing demanding requirements on the software and computer hardware Reads are short – high coverage needed, imposing demanding requirements on the software and computer hardware Error patterns in the reads: Error patterns in the reads: substitution errors in Illumina reads substitution errors in Illumina reads homopolymer errors (unable to tell AAAA from AAA) in 454 reads homopolymer errors (unable to tell AAAA from AAA) in 454 reads Biased coverage by Illumina reads depending on the CG content Biased coverage by Illumina reads depending on the CG content Unreliable mate pairs: Unreliable mate pairs: Assembly techniques have much larger impact now Assembly techniques have much larger impact now could actually be 13
NGS Assemblers New assemblers developed for different kinds of NGS data: New assemblers developed for different kinds of NGS data: Newbler for 454 data Newbler for 454 data SOAPdenovo, Velvet, ABYSS, ALLPATHS, and others for Illumina data SOAPdenovo, Velvet, ABYSS, ALLPATHS, and others for Illumina data We use open source Celera Assembler currently supported by J. Craig Venter Institute bioinformatics team We use open source Celera Assembler currently supported by J. Craig Venter Institute bioinformatics team CA is capable of assembling mixed data sets CA is capable of assembling mixed data sets 14
Assembly quality varies significantly with the software used Example 1: Argentine ant assembly comparison. Example 1: Argentine ant assembly comparison. Both assemblies used the same 75bp Illumina reads, unmated and in 3kb and 8kb mate pairs Both assemblies used the same 75bp Illumina reads, unmated and in 3kb and 8kb mate pairs SOAPdenovoCA 5.4Improvement Sequence in assembly 137 Mbp171 Mbp25% N50 Scaffold size139 bp386,149 bp3000 times N50 Contig size139 bp3,367 bp24 times 15
Assembly quality varies significantly with the software used Example 2. Pogonomyrmex barbatus, the Red Harvester Ant assembly comparison (454 data). Example 2. Pogonomyrmex barbatus, the Red Harvester Ant assembly comparison (454 data). Both assemblies used the same 454 data in 3kb mate pairs, 8kb mate pairs and shotgun reads Both assemblies used the same 454 data in 3kb mate pairs, 8kb mate pairs and shotgun reads NewblerCA 5.3Improvement Sequence in assembly 194 Mbp220 Mbp13% N50 Scaffold size47 Kbp794 Kbp17 times N50 Contig size2 Kbp12 Kbp6 times 16
Benefits of combining 454 and Illumina data Example 3: Argentine ant assembly comparison assembled Illumina data and 454 data with Celera Assembler 5.4. Example 3: Argentine ant assembly comparison assembled Illumina data and 454 data with Celera Assembler x Illumina coverage, 15x 454 coverage 45x Illumina coverage, 15x 454 coverage Unmated reads, 3kb and 8kb mate pairs Unmated reads, 3kb and 8kb mate pairs Illumina only454 onlyIllumina and 454 Sequence in assembly (Mbp) N50 Scaffold size (Kbp) ,459 N50 Contig size (Kbp)
Post-assembly steps Assemblers output scaffolds – ordered and oriented collections of contigs. Assemblers output scaffolds – ordered and oriented collections of contigs. Scaffolds typically are much smaller than chromosomes and may contain large-scale errors. Scaffolds typically are much smaller than chromosomes and may contain large-scale errors. Some mate pair linking information remains unused by assemblers. Some mate pair linking information remains unused by assemblers. Marker maps, i.e. collections of short sequences whose positions on the chromosomes are known, can be used to position the contigs on the chromosomes. Marker maps, i.e. collections of short sequences whose positions on the chromosomes are known, can be used to position the contigs on the chromosomes. 18
UMD Chromosome builder Uses contigs, mate pairs and markers, discarding unreliable scaffold information Uses contigs, mate pairs and markers, discarding unreliable scaffold information Mapping steps: Mapping steps: Use mate pairs to orient contigs Use mate pairs to orient contigs Use markers and mate pairs to assign oriented contigs to the chromosomes Use markers and mate pairs to assign oriented contigs to the chromosomes Compute position of each contig on the chromosome as the best least-square fit to the available mate pair and marker data Compute position of each contig on the chromosome as the best least-square fit to the available mate pair and marker data 19
Computing contig orientations An orientation problem: An orientation problem: A B C 20
Computing contig orientations An orientation problem: An orientation problem: A B C 21
Computing contig orientations An orientation problem: An orientation problem: Matrix form: Matrix form: Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping the contigs to achieve consistent orientations Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping the contigs to achieve consistent orientations A B C A B C A M = B C 22
Computing contig orientations The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue =2: y=(0.5774, , ). The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue =2: y=(0.5774, , ). sign(y) = (1, 1, -1), that is the solution is to flip contig C sign(y) = (1, 1, -1), that is the solution is to flip contig C Final matrix of orientations = diag(sign(y))*M* diag(sign(y)): Final matrix of orientations = diag(sign(y))*M* diag(sign(y)): Flipping C is the correct solution! Flipping C is the correct solution! =
Conclusions Genome assembly is a difficult problem that has gotten harder because of Next Gen Sequencing data Genome assembly is a difficult problem that has gotten harder because of Next Gen Sequencing data Assembly techniques have large impact on the quality of the assembly Assembly techniques have large impact on the quality of the assembly Output of the assembler is not the final assembly; extensive post-processing is required to produce chromosome sequences Output of the assembler is not the final assembly; extensive post-processing is required to produce chromosome sequences 24