Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
Sanger Sequencing ◦ Cycle sequencing rxn ◦ ddNTP-terminated dye- labeled products ◦ High-resolution electrophoretic separation ◦ Parallelized in 96 or 384 capillaries ◦ Read lengths up to 1kBp ◦ Raw accuracy up to % ◦ Costs 50 ¢ per kB Sequencing Methods
Second Gen. Sequencing ◦ Cyclical array methods 454 Illumina AB SOLiD Polonator HeliScope ◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow Sequencing Methods
Illumina
Illumina continued
AB SOLiD
Create a DNA library ◦ Ligate adaptors to fragments Emulsion PCR ◦ Agarose beads ◦ Oil, water, PCR reagents ◦ Results in 1 mill copies / fragment for each bead 454 Pyrosequencing
Beads arrayed into picotiter plate ◦ Immobilized via addition of enzyme containing beads ◦ Each cell contains exactly 1 bead Bst polymerase, luciferase, apyrase, ATP sulferylase used More 454
Even more 454 Example of Output Flow Order TACGTACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position
Videos (454 Workflow)
Videos (Pyrosequencing) note: we did not choose the music
Comparison of 2 nd Gen Platforms
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
De Bruijn Graph assemblers and Overlay Graph assemblers De Bruijn Graph assemblers ◦ Velvet, Abyss, Euler Overlay Graph assemblers ◦ Newbler, Edena, SSAKE, VCAKE
Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate ◦ Human chr 22, ~33.5M bases ◦ Streptococcus Suis, NC_ , ~2M bases ◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases Write anther C program to measure the quality of assemblers ◦ N50 length ◦ No. of contigs ◦ Max contig length ◦ No. of mis-assembled contigs Synthetic Data used for Experiments
De Bruijn graph assemblers are only suitable for short reads data K limitation ◦ Use Hash table or Sorting to index K-mers Need use a unique key(value) to represent each K-mer K= = bit integer (unsigned int) K= = bit integer (unsigned long long) K>32? multiple integer to represent the hash table key Read Length
Simulate reads from Streptococcus Suis 300 read length, 50X coverage, error rate 0.1% Velvet default: K <= 31, so we use 31 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet46515 ( bp)115 bp5 (1346 bp) Recompile velvet, K = 99 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet441( bp)15328 bp1 (34 bp)
It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality” Quality and Accuracy
Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet336 ( bp)10.4 kbp17 ( bp) Edena340 ( bp)9,8 kbp0 (0 bp) Simulate reads from Helicobacter acinonychis Sheeba 35 read length, 50X coverage, error rate 0.1%
Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet1106 ( bp)5266 bp12 ( bp) Edena1003 ( bp)6416 bp0 (0 bp) Simulate reads from Streptococcus Suis 35 read length, 50X coverage, error rate 0.1%
Overlap graph based assemblers are computing-expensive and use more memory ◦ All-to-all alignment of reads, O(n 2 ) ◦ Use more memory to store overlap graph Typically, number of reads is weigh larger than the number of K-mers ◦ Especially for short reads data With the same coverage and genome length, shorter reads means more reads ◦ It is stated that De Bruijn graph are more suitable for NGS data Shorter reads, and high throughput Runtime and Memory Usage
AssemblersTimeMemory Velvet33 secs~220 M SSAKE26 mins~900 M VCAKE107 mins~1.1 G Simulate reads from Streptococcus Suis reads 50 read length, 20X coverage, error rate 0.1% Xeon E GHz
Recent advance of pattern matching algorithms and technical enable the use of overlap graph ◦ Suffix tree, Suffix array, Prefix array, compressed suffix array Suffix array ◦ Be able to find overlap between reads in linear time ◦ Usage of compressed suffix array can significantly reduce the memory requirements of overlap graph assemblers Examples ◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18: , ◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373. ◦ Pasqual Pushkar and I have developed a parallel sequence assembler based on overlap graph in our research project However!
AssemblersTimeMemory Velvet292 mins~17 GB Edena37 mins~7 GB Pasqual43 mins~8 GB Parallel Pasqual9 mins~8 GB Simulate reads from Human chr22 reads 50 read length, 20X coverage, error rate 0.1% Xeon E GHz with 4 cores/8 threads
H. influenzae ◦ 30 ~ 300 length Velvet does not work ◦ K is fixed ◦ If we use big K, the reads shorter than K can not be assembled. ◦ If we use small K, it is difficult to assemble the long reads Overlap graph assemblers do not have this issue ◦ Newbler Mixed Length Reads
Controversial ◦ It is still unclear about the relation between De Bruijn graph and Overlap graph We can still conclude from the experiments ◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler ◦ De Bruijn graph assemblers does not work for long reads ◦ De Bruijn graph assemblers does not work for mixed length reads (K is fixed) ◦ Traditional overlap graph assemblers are slower and use more memory, but latest assemblers are better than De Bruijn graph assemblers Conclusion
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
Quality score and length distribution Mean lengthMedian lengthStd dev M
Quality score and length distribution Mean lengthMedian lengthStd dev M
Quality score and length distribution Mean lengthMedian lengthStd dev M
Quality score and length distribution Mean lengthMedian lengthStd dev M
Quality score and length distribution Mean lengthMedian lengthStd dev M
Quality score and length distribution Mean lengthMedian lengthStd dev M
Velvet IdKNo. of contigsN50Max lengthTotal length% reads used M M M M M M $> velveth -fasta -long $> velvetg Input: Fasta/Fastq Output: Fasta
WGS assembler (Celera) IdNo.of ContigsN50Max lengthTotal length% reads used M M M M M M $> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff $> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg Input: frg format Output: Fasta >50 separate programs make up the Celera Assembler pipeline runCA script helps manage them all
Newbler De Novo Assembly IdNo.of ContigsN50Max lengthTotal length M M M M M M Reference Assembly – (Haemophilus-influenzae-refseq.fasta ) IdNo.of ContigsN50Max lengthTotal length M M M M M M Input:.sff Output: Fasta $> runAssembly // de novo assembly
MIRA IdNo.of ContigsN50Max lengthTotal length% reads used M M M M M M MIRA stands for Mimicking Intelligent Read Assembly $> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff $> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log Input: Fasta + qual + trace info Output: Fasta, Ace
Eagle view - M19107.ace
Eagle view - M19501.ace
“Next-generation DNA sequencing” Shendure et. al, NatureBiotechnology-2008.pdf NatureBiotechnology-2008.pdf “Next-generation DNA sequencing methods” Mardis et. al, AnnuRevGenet-2008.pdf AnnuRevGenet-2008.pdf Works Cited
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
Download the Lab Exercise file from the Genome Assembly wiki page Lab Exercise