Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

 Sequencing Methods  Experimental comparison of De Bruijn graph and Overlay graph assemblers  Preliminary Results  Lab Exercise

 Sanger Sequencing ◦ Cycle sequencing rxn ◦ ddNTP-terminated dye- labeled products ◦ High-resolution electrophoretic separation ◦ Parallelized in 96 or 384 capillaries ◦ Read lengths up to 1kBp ◦ Raw accuracy up to 99.999% ◦ Costs 50 ¢ per kB Sequencing Methods

 Second Gen. Sequencing ◦ Cyclical array methods  454  Illumina  AB SOLiD  Polonator  HeliScope ◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow Sequencing Methods

Illumina

Illumina continued

AB SOLiD

 Create a DNA library ◦ Ligate adaptors to fragments  Emulsion PCR ◦ Agarose beads ◦ Oil, water, PCR reagents ◦ Results in 1 mill copies / fragment for each bead 454 Pyrosequencing

 Beads arrayed into picotiter plate ◦ Immobilized via addition of enzyme containing beads ◦ Each cell contains exactly 1 bead  Bst polymerase, luciferase, apyrase, ATP sulferylase used More 454

Even more 454 Example of Output Flow Order TACGTACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position

Videos (454 Workflow)

Videos (Pyrosequencing) note: we did not choose the music

Comparison of 2 nd Gen Platforms

De Bruijn Graph assemblers and Overlay Graph assemblers  De Bruijn Graph assemblers ◦ Velvet, Abyss, Euler  Overlay Graph assemblers ◦ Newbler, Edena, SSAKE, VCAKE

 Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate ◦ Human chr 22, ~33.5M bases ◦ Streptococcus Suis, NC_012925.1, ~2M bases ◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases  Write anther C program to measure the quality of assemblers ◦ N50 length ◦ No. of contigs ◦ Max contig length ◦ No. of mis-assembled contigs Synthetic Data used for Experiments

 De Bruijn graph assemblers are only suitable for short reads data  K limitation ◦ Use Hash table or Sorting to index K-mers  Need use a unique key(value) to represent each K-mer  K=16 4 16 =2 32 32-bit integer (unsigned int)  K=32 4 32 =2 64 64-bit integer (unsigned long long)  K>32? multiple integer to represent the hash table key Read Length

 Simulate reads from Streptococcus Suis  300 read length, 50X coverage, error rate 0.1%  Velvet default: K <= 31, so we use 31 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet46515 (1716053 bp)115 bp5 (1346 bp)  Recompile velvet, K = 99 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet441(1974382 bp)15328 bp1 (34 bp)

 It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality” Quality and Accuracy

Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet336 (1525746 bp)10.4 kbp17 (156637 bp) Edena340 (1513259 bp)9,8 kbp0 (0 bp)  Simulate reads from Helicobacter acinonychis Sheeba  35 read length, 50X coverage, error rate 0.1%

Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet1106 (1969617 bp)5266 bp12 (255594 bp) Edena1003 (1970342 bp)6416 bp0 (0 bp)  Simulate reads from Streptococcus Suis  35 read length, 50X coverage, error rate 0.1%

 Overlap graph based assemblers are computing-expensive and use more memory ◦ All-to-all alignment of reads, O(n 2 ) ◦ Use more memory to store overlap graph  Typically, number of reads is weigh larger than the number of K-mers ◦ Especially for short reads data  With the same coverage and genome length, shorter reads means more reads ◦ It is stated that De Bruijn graph are more suitable for NGS data  Shorter reads, and high throughput Runtime and Memory Usage

AssemblersTimeMemory Velvet33 secs~220 M SSAKE26 mins~900 M VCAKE107 mins~1.1 G  Simulate reads from Streptococcus Suis  802995 reads  50 read length, 20X coverage, error rate 0.1%  Xeon E5530 2.4 GHz

 Recent advance of pattern matching algorithms and technical enable the use of overlap graph ◦ Suffix tree, Suffix array, Prefix array, compressed suffix array  Suffix array ◦ Be able to find overlap between reads in linear time ◦ Usage of compressed suffix array can significantly reduce the memory requirements of overlap graph assemblers  Examples ◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18:802-809, 2008. ◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373. ◦ Pasqual  Pushkar and I have developed a parallel sequence assembler based on overlap graph in our research project However!

AssemblersTimeMemory Velvet292 mins~17 GB Edena37 mins~7 GB Pasqual43 mins~8 GB Parallel Pasqual9 mins~8 GB  Simulate reads from Human chr22  6978908 reads  50 read length, 20X coverage, error rate 0.1%  Xeon E5530 2.4 GHz with 4 cores/8 threads

 H. influenzae ◦ 30 ~ 300 length  Velvet does not work ◦ K is fixed ◦ If we use big K, the reads shorter than K can not be assembled. ◦ If we use small K, it is difficult to assemble the long reads  Overlap graph assemblers do not have this issue ◦ Newbler Mixed Length Reads

 Controversial ◦ It is still unclear about the relation between De Bruijn graph and Overlap graph  We can still conclude from the experiments ◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler ◦ De Bruijn graph assemblers does not work for long reads ◦ De Bruijn graph assemblers does not work for mixed length reads (K is fixed) ◦ Traditional overlap graph assemblers are slower and use more memory, but latest assemblers are better than De Bruijn graph assemblers Conclusion

Quality score and length distribution Mean lengthMedian lengthStd dev M19107577.584956983.9605

Velvet IdKNo. of contigsN50Max lengthTotal length% reads used M191071921716016665290554397.3535 2917674126655331503388.7319 M195011961803613429471628678.9177 2953707718490572553035.5981 M211271931999915483349861391.4239 2925994224416399841873.0187 M216211921887216640305252293.7490 2915785326838325683787.5425 M216391977086713628581886885.0236 2968033919601734859946.1671 M217091929115616768342563295.7695 2920773625816363741983.8704 $> velveth -fasta -long $> velvetg Input: Fasta/Fastq Output: Fasta

WGS assembler (Celera) IdNo.of ContigsN50Max lengthTotal length% reads used M191072361188132038176606096.3570 M195012141230451927811298.6032 M21127345834926765194795597.9181 M21621356779130668189263398.1710 M216393262092991261081398.3939 M21709520439315002170004098.5221 $> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff $> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg Input: frg format Output: Fasta >50 separate programs make up the Celera Assembler pipeline runCA script helps manage them all

Newbler De Novo Assembly IdNo.of ContigsN50Max lengthTotal length M19107217156593800025112606 M1950175157459343196106836011 M211275912125631627440693944 M216215013843733942450432798 M2163917543023182797158028027 M217095214012831986969503256 Reference Assembly – (Haemophilus-influenzae-refseq.fasta ) IdNo.of ContigsN50Max lengthTotal length M1910712602496104091224223 M195019883503187241380153 M21127---- M21621---- M2163912722701137121416318 M2170931313836702981607841 Input:.sff Output: Fasta $> runAssembly // de novo assembly

MIRA IdNo.of ContigsN50Max lengthTotal length% reads used M191072081837951687179513495.7478 M19501181185484321569190119897.7347 M211278981157305626195124097.4776 M216216790877253924188748497.5015 M2163917590800152373237888898.1330 M217098362871197745184024897.6776 MIRA stands for Mimicking Intelligent Read Assembly $> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff $> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log Input: Fasta + qual + trace info Output: Fasta, Ace

Eagle view - M19107.ace

Eagle view - M19501.ace

 “Next-generation DNA sequencing” Shendure et. al, http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure- NatureBiotechnology-2008.pdf http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure- NatureBiotechnology-2008.pdf  “Next-generation DNA sequencing methods” Mardis et. al, http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis- AnnuRevGenet-2008.pdf http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis- AnnuRevGenet-2008.pdf Works Cited

 Download the Lab Exercise file from the Genome Assembly wiki page Lab Exercise

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

Similar presentations

Presentation on theme: "Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

Similar presentations

Presentation on theme: "Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye."— Presentation transcript:

Similar presentations

About project

Feedback