Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work
≈ $ billions ≈ several years ≈ hundreds of people ≈ $ thousands ≈ several weeks ≈ two people 2
High Throughput Sequencing Assemblies 3
4 Sample Preparation Sequencing Assembly Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Draft Genome from HTS
5 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly HTS assemblies (contigs) still contain an abundance of error: subst. errors per 100kbp with SOAPdenovo subst. errors per 100kbp with Velvet. Small (<50 bp) INDEL errors. Misassemblies, large INDELs, etc.
6 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly Errors in the assembled contigs will profoundly affect any downstream analysis.
7 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly SEQuel Refined Contigs
De Bruijn Graph for Fragment Assembly
De Bruijn Graph GCC CCA CAT ATT TTA GCC CCT CTTCTT CTTCTT TTT TTA CCT CTA TAT ATT (Pevzner, Tang, Waterman 2001) 9
De Bruijn Graph GCC CCA CAT ATT TTA GCCCCT CTTCTTCTTCTT TTT TTA CCT CTA TAT ATT (Pevzner, Tang, Waterman 2001) 10
De Bruijn Graph GCC CAT ATT TTA GCC CCT CTTCTTCTTCTT TTT TTA CCT CTA TAT ATT CCA (Pevzner, Tang, Waterman 2001) 11
De Bruijn Graph GCC CAT ATT TTA GCC CTTCTTCTTCTT TTT TTA CTA TAT ATT CCA CCT (Pevzner, Tang, Waterman 2001) 12
De Bruijn Graph GCC CAT ATT TTA CTTCTTCTTCTT TTT TTA CTA TAT ATT CCA CCT (Pevzner, Tang, Waterman 2001) 13
De Bruijn Graph 14
Challenges
GCC CCT CTA TAG AGGGGA GAC CAC ACT CTT TTG TGGGGC GCA GCCTAGGAC CACTTGGCA GCCTAGGAC CACTTGGCA 16
17 Sequencing errors cause bulges in the de Bruijn graph GCC CCT CTA TAG AGGGGA GAC CAC ACT CTT TTG TGGGGC GCA GCCTAGGAC CACTTGGCA GCCTAGGAC GCCTTGGAC CACTTGGCA CCTT TGGA CTTG TTGA
18 Sequencing errors cause bulges in the de Bruijn graph GCC CCT CTATAGAGG GGA GAC CAC ACT CTT TTGTTG TGG GGC GCA GCCTAGGAC CACTTGGCA GCCTAGGAC GCCTTGGAC CACTTGGCA
19 Sequencing errors cause bulges in the de Bruijn graph GCC CCT GGA GAC CAC ACT CTT TTGTTG TGG GGC GCA GCCTAGGAC CACTTGGCA GCCTAGGAC GCCTTGGAC CACTTGGCA CACTTGGCA GCCTTGGAC......
The SEQuel Algorithm
21 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly SEQuel Refined Contigs
Permissively aligned read-pair: a read-pair for which at least one read aligned uniquely The SEQuel Algorithm 22
Positional De Bruijn Graph 23
Positional De Bruijn Graph GCC,111 CCA,112 CAT,113 ATT,114 TTA,115 CCT,112 CTT,113 TTT,114 TTA,115 GCC,975 CCT,976 CTA,977 TAT,978 ATT,979 Positional k-mer: a pair (k-mer, position), e.g. (GCCA, 111). 24
Positional De Bruijn Graph GCC,111 CCA,112 CAT,113 ATT,114 TTA,115CCT,112 CTT,113 TTT,114 TTA,115 GCC,975 CCT,976 CTA,977 TAT,978 ATT,979CCA,112ATT,114 CAT,113 ATT,979 25
Positional De Bruijn Graph
partial contig #1: GCCATTA partial contig #2: GCCTATT The SEQuel Algorithm 27 GTATTCCGAGGACCACTGGATTATGA Original contig
28 The SEQuel Algorithm GTATTCCGAGGACCACTGGATTATGA
29 GTATTCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm
30 GTATTCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm
31 GCGGGCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm
32 GCGGGCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm
33 GCGGGCCGAGGACCACAAATGGATTACGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm
34 GCGGGCCGAGGACCACAAATGGATTACGA The SEQuel Algorithm Repeat for all contigs.
35 Results Standard and Single-Cell E. coli. 100 bp paired-end, Illumina (GAII) reads. Mean coverage ≈ 600x. Assemblies compared to reference with & without SEQuel.
Standard E. coli 36
Standard E. coli 37
Single Cell Sequencing Standard Single Cell (Chitsaz et al., 2011) 38
Single Cell E. coli 39
Single Cell E. coli 40
Summary 41 Removed 35% to 96% of small-scale assembly errors. Introduced positional de Bruijn graph for contig refinement. Demonstrated utility in hard (single-cell) assembly. SEQuel can be used in combination with any assembler. Freely available at:
3P41RR S1 Acknowledgments CCF