Download presentation
Presentation is loading. Please wait.
Published byAnthony Ward Modified over 9 years ago
1
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye
2
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
3
Sanger Sequencing ◦ Cycle sequencing rxn ◦ ddNTP-terminated dye- labeled products ◦ High-resolution electrophoretic separation ◦ Parallelized in 96 or 384 capillaries ◦ Read lengths up to 1kBp ◦ Raw accuracy up to 99.999% ◦ Costs 50 ¢ per kB Sequencing Methods
4
Second Gen. Sequencing ◦ Cyclical array methods 454 Illumina AB SOLiD Polonator HeliScope ◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow Sequencing Methods
5
Illumina
6
Illumina continued
7
AB SOLiD
8
Create a DNA library ◦ Ligate adaptors to fragments Emulsion PCR ◦ Agarose beads ◦ Oil, water, PCR reagents ◦ Results in 1 mill copies / fragment for each bead 454 Pyrosequencing
9
Beads arrayed into picotiter plate ◦ Immobilized via addition of enzyme containing beads ◦ Each cell contains exactly 1 bead Bst polymerase, luciferase, apyrase, ATP sulferylase used More 454
10
Even more 454 Example of Output Flow Order TACGTACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position
11
Videos (454 Workflow)
12
Videos (Pyrosequencing) note: we did not choose the music
13
Comparison of 2 nd Gen Platforms
14
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
15
De Bruijn Graph assemblers and Overlay Graph assemblers De Bruijn Graph assemblers ◦ Velvet, Abyss, Euler Overlay Graph assemblers ◦ Newbler, Edena, SSAKE, VCAKE
16
Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate ◦ Human chr 22, ~33.5M bases ◦ Streptococcus Suis, NC_012925.1, ~2M bases ◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases Write anther C program to measure the quality of assemblers ◦ N50 length ◦ No. of contigs ◦ Max contig length ◦ No. of mis-assembled contigs Synthetic Data used for Experiments
17
De Bruijn graph assemblers are only suitable for short reads data K limitation ◦ Use Hash table or Sorting to index K-mers Need use a unique key(value) to represent each K-mer K=16 4 16 =2 32 32-bit integer (unsigned int) K=32 4 32 =2 64 64-bit integer (unsigned long long) K>32? multiple integer to represent the hash table key Read Length
18
Simulate reads from Streptococcus Suis 300 read length, 50X coverage, error rate 0.1% Velvet default: K <= 31, so we use 31 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet46515 (1716053 bp)115 bp5 (1346 bp) Recompile velvet, K = 99 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet441(1974382 bp)15328 bp1 (34 bp)
19
It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality” Quality and Accuracy
20
Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet336 (1525746 bp)10.4 kbp17 (156637 bp) Edena340 (1513259 bp)9,8 kbp0 (0 bp) Simulate reads from Helicobacter acinonychis Sheeba 35 read length, 50X coverage, error rate 0.1%
21
Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet1106 (1969617 bp)5266 bp12 (255594 bp) Edena1003 (1970342 bp)6416 bp0 (0 bp) Simulate reads from Streptococcus Suis 35 read length, 50X coverage, error rate 0.1%
22
Overlap graph based assemblers are computing-expensive and use more memory ◦ All-to-all alignment of reads, O(n 2 ) ◦ Use more memory to store overlap graph Typically, number of reads is weigh larger than the number of K-mers ◦ Especially for short reads data With the same coverage and genome length, shorter reads means more reads ◦ It is stated that De Bruijn graph are more suitable for NGS data Shorter reads, and high throughput Runtime and Memory Usage
23
AssemblersTimeMemory Velvet33 secs~220 M SSAKE26 mins~900 M VCAKE107 mins~1.1 G Simulate reads from Streptococcus Suis 802995 reads 50 read length, 20X coverage, error rate 0.1% Xeon E5530 2.4 GHz
24
Recent advance of pattern matching algorithms and technical enable the use of overlap graph ◦ Suffix tree, Suffix array, Prefix array, compressed suffix array Suffix array ◦ Be able to find overlap between reads in linear time ◦ Usage of compressed suffix array can significantly reduce the memory requirements of overlap graph assemblers Examples ◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18:802-809, 2008. ◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373. ◦ Pasqual Pushkar and I have developed a parallel sequence assembler based on overlap graph in our research project However!
25
AssemblersTimeMemory Velvet292 mins~17 GB Edena37 mins~7 GB Pasqual43 mins~8 GB Parallel Pasqual9 mins~8 GB Simulate reads from Human chr22 6978908 reads 50 read length, 20X coverage, error rate 0.1% Xeon E5530 2.4 GHz with 4 cores/8 threads
26
H. influenzae ◦ 30 ~ 300 length Velvet does not work ◦ K is fixed ◦ If we use big K, the reads shorter than K can not be assembled. ◦ If we use small K, it is difficult to assemble the long reads Overlap graph assemblers do not have this issue ◦ Newbler Mixed Length Reads
27
Controversial ◦ It is still unclear about the relation between De Bruijn graph and Overlap graph We can still conclude from the experiments ◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler ◦ De Bruijn graph assemblers does not work for long reads ◦ De Bruijn graph assemblers does not work for mixed length reads (K is fixed) ◦ Traditional overlap graph assemblers are slower and use more memory, but latest assemblers are better than De Bruijn graph assemblers Conclusion
28
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
29
Quality score and length distribution Mean lengthMedian lengthStd dev M19107577.584956983.9605
30
Quality score and length distribution Mean lengthMedian lengthStd dev M19501624.717262178.4074
31
Quality score and length distribution Mean lengthMedian lengthStd dev M21127618.757661681.5678
32
Quality score and length distribution Mean lengthMedian lengthStd dev M21621620.630562183.978
33
Quality score and length distribution Mean lengthMedian lengthStd dev M21639573.38456466.5525
34
Quality score and length distribution Mean lengthMedian lengthStd dev M21709626.245962478.2447
35
Velvet IdKNo. of contigsN50Max lengthTotal length% reads used M191071921716016665290554397.3535 2917674126655331503388.7319 M195011961803613429471628678.9177 2953707718490572553035.5981 M211271931999915483349861391.4239 2925994224416399841873.0187 M216211921887216640305252293.7490 2915785326838325683787.5425 M216391977086713628581886885.0236 2968033919601734859946.1671 M217091929115616768342563295.7695 2920773625816363741983.8704 $> velveth -fasta -long $> velvetg Input: Fasta/Fastq Output: Fasta
36
WGS assembler (Celera) IdNo.of ContigsN50Max lengthTotal length% reads used M191072361188132038176606096.3570 M195012141230451927811298.6032 M21127345834926765194795597.9181 M21621356779130668189263398.1710 M216393262092991261081398.3939 M21709520439315002170004098.5221 $> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff $> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg Input: frg format Output: Fasta >50 separate programs make up the Celera Assembler pipeline runCA script helps manage them all
37
Newbler De Novo Assembly IdNo.of ContigsN50Max lengthTotal length M19107217156593800025112606 M1950175157459343196106836011 M211275912125631627440693944 M216215013843733942450432798 M2163917543023182797158028027 M217095214012831986969503256 Reference Assembly – (Haemophilus-influenzae-refseq.fasta ) IdNo.of ContigsN50Max lengthTotal length M1910712602496104091224223 M195019883503187241380153 M21127---- M21621---- M2163912722701137121416318 M2170931313836702981607841 Input:.sff Output: Fasta $> runAssembly // de novo assembly
38
MIRA IdNo.of ContigsN50Max lengthTotal length% reads used M191072081837951687179513495.7478 M19501181185484321569190119897.7347 M211278981157305626195124097.4776 M216216790877253924188748497.5015 M2163917590800152373237888898.1330 M217098362871197745184024897.6776 MIRA stands for Mimicking Intelligent Read Assembly $> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff $> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log Input: Fasta + qual + trace info Output: Fasta, Ace
39
Eagle view - M19107.ace
40
Eagle view - M19501.ace
41
“Next-generation DNA sequencing” Shendure et. al, http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure- NatureBiotechnology-2008.pdf http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure- NatureBiotechnology-2008.pdf “Next-generation DNA sequencing methods” Mardis et. al, http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis- AnnuRevGenet-2008.pdf http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis- AnnuRevGenet-2008.pdf Works Cited
42
Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise
43
Download the Lab Exercise file from the Genome Assembly wiki page Lab Exercise
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.