Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

Slides:

Advertisements

Similar presentations

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.

Advertisements

Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.

Next Generation Sequencing, Assembly, and Alignment Methods

Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland.

Greg Phillips Veterinary Microbiology

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

1 Next Generation Sequencing Itai Sharon November 11th, 2009 Introduction to Bioinformatics.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.

Genome sequencing and assembling

CS 6293 Advanced Topics: Current Bioinformatics

The impact of next-generation sequencing technology of genetics Elaine R. Mardis – 11 February Washington School of Medicine, Genome Sequencing Center.

De-novo Assembly Day 4.

CS 394C March 19, 2012 Tandy Warnow.

Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.

Todd J. Treangen, Steven L. Salzberg

Introduction to next generation sequencing Rolf Sommer Kaas.

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.

1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

Genome Assembly Preliminary Results

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Next Generation DNA Sequencing

Quick introduction to genomic file types Preliminary quality control (lab)

The Changing Face of Sequencing

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.

How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.

Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)

Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.

billion-piece genome puzzle

University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

Ultra-High Throughput DNA Sequencing on the 454/Roche GS-FLX

Sequencing tutorial Peter HANTZ EMBL Heidelberg.

De Novo Genome Assembly - Introduction

Effective Parallel Multicore-optimized K-mers Counting Algorithm

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.

Sequencing technologies and Velvet assembly Lecturer ： Du Shengyang September 29 ， 2012.

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results

Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.

DNA Sequencing First generation techniques

Next-generation sequencing technology

Next generation sequencing

Assembly algorithms for next-generation sequencing data

Sequencing technologies

Quality Control & Preprocessing of Metagenomic Data

DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

A Fast Hybrid Short Read Fragment Assembly Algorithm

Professors: Dr. Gribskov and Dr. Weil

Next-generation sequencing technology

Introduction to Genome Assembly

Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,

CS 598AGB Genome Assembly Tandy Warnow.

MapView: visualization of short reads alignment on a desktop computer

The impact of next-generation sequencing technology on genetics

Genome Sequencing and Assembly

(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

Presentation transcript:

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

 Sequencing Methods  Experimental comparison of De Bruijn graph and Overlay graph assemblers  Preliminary Results  Lab Exercise

 Sanger Sequencing ◦ Cycle sequencing rxn ◦ ddNTP-terminated dye- labeled products ◦ High-resolution electrophoretic separation ◦ Parallelized in 96 or 384 capillaries ◦ Read lengths up to 1kBp ◦ Raw accuracy up to % ◦ Costs 50 ¢ per kB Sequencing Methods

 Second Gen. Sequencing ◦ Cyclical array methods  454  Illumina  AB SOLiD  Polonator  HeliScope ◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow Sequencing Methods

Illumina

Illumina continued

AB SOLiD

 Create a DNA library ◦ Ligate adaptors to fragments  Emulsion PCR ◦ Agarose beads ◦ Oil, water, PCR reagents ◦ Results in 1 mill copies / fragment for each bead 454 Pyrosequencing

 Beads arrayed into picotiter plate ◦ Immobilized via addition of enzyme containing beads ◦ Each cell contains exactly 1 bead  Bst polymerase, luciferase, apyrase, ATP sulferylase used More 454

Even more 454 Example of Output Flow Order TACGTACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position

Videos (454 Workflow)

Videos (Pyrosequencing) note: we did not choose the music

Comparison of 2 nd Gen Platforms

 Sequencing Methods  Experimental comparison of De Bruijn graph and Overlay graph assemblers  Preliminary Results  Lab Exercise

De Bruijn Graph assemblers and Overlay Graph assemblers  De Bruijn Graph assemblers ◦ Velvet, Abyss, Euler  Overlay Graph assemblers ◦ Newbler, Edena, SSAKE, VCAKE

 Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate ◦ Human chr 22, ~33.5M bases ◦ Streptococcus Suis, NC_ , ~2M bases ◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases  Write anther C program to measure the quality of assemblers ◦ N50 length ◦ No. of contigs ◦ Max contig length ◦ No. of mis-assembled contigs Synthetic Data used for Experiments

 De Bruijn graph assemblers are only suitable for short reads data  K limitation ◦ Use Hash table or Sorting to index K-mers  Need use a unique key(value) to represent each K-mer  K= = bit integer (unsigned int)  K= = bit integer (unsigned long long)  K>32? multiple integer to represent the hash table key Read Length

 Simulate reads from Streptococcus Suis  300 read length, 50X coverage, error rate 0.1%  Velvet default: K <= 31, so we use 31 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet46515 ( bp)115 bp5 (1346 bp)  Recompile velvet, K = 99 # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet441( bp)15328 bp1 (34 bp)

 It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality” Quality and Accuracy

Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet336 ( bp)10.4 kbp17 ( bp) Edena340 ( bp)9,8 kbp0 (0 bp)  Simulate reads from Helicobacter acinonychis Sheeba  35 read length, 50X coverage, error rate 0.1%

Assembl ers # of contigs (total length) N50 length# of misassembled contigs (total length) Velvet1106 ( bp)5266 bp12 ( bp) Edena1003 ( bp)6416 bp0 (0 bp)  Simulate reads from Streptococcus Suis  35 read length, 50X coverage, error rate 0.1%

 Overlap graph based assemblers are computing-expensive and use more memory ◦ All-to-all alignment of reads, O(n 2 ) ◦ Use more memory to store overlap graph  Typically, number of reads is weigh larger than the number of K-mers ◦ Especially for short reads data  With the same coverage and genome length, shorter reads means more reads ◦ It is stated that De Bruijn graph are more suitable for NGS data  Shorter reads, and high throughput Runtime and Memory Usage

AssemblersTimeMemory Velvet33 secs~220 M SSAKE26 mins~900 M VCAKE107 mins~1.1 G  Simulate reads from Streptococcus Suis  reads  50 read length, 20X coverage, error rate 0.1%  Xeon E GHz

 Recent advance of pattern matching algorithms and technical enable the use of overlap graph ◦ Suffix tree, Suffix array, Prefix array, compressed suffix array  Suffix array ◦ Be able to find overlap between reads in linear time ◦ Usage of compressed suffix array can significantly reduce the memory requirements of overlap graph assemblers  Examples ◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18: , ◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373. ◦ Pasqual  Pushkar and I have developed a parallel sequence assembler based on overlap graph in our research project However!

AssemblersTimeMemory Velvet292 mins~17 GB Edena37 mins~7 GB Pasqual43 mins~8 GB Parallel Pasqual9 mins~8 GB  Simulate reads from Human chr22  reads  50 read length, 20X coverage, error rate 0.1%  Xeon E GHz with 4 cores/8 threads

 H. influenzae ◦ 30 ~ 300 length  Velvet does not work ◦ K is fixed ◦ If we use big K, the reads shorter than K can not be assembled. ◦ If we use small K, it is difficult to assemble the long reads  Overlap graph assemblers do not have this issue ◦ Newbler Mixed Length Reads

 Controversial ◦ It is still unclear about the relation between De Bruijn graph and Overlap graph  We can still conclude from the experiments ◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler ◦ De Bruijn graph assemblers does not work for long reads ◦ De Bruijn graph assemblers does not work for mixed length reads (K is fixed) ◦ Traditional overlap graph assemblers are slower and use more memory, but latest assemblers are better than De Bruijn graph assemblers Conclusion

 Sequencing Methods  Experimental comparison of De Bruijn graph and Overlay graph assemblers  Preliminary Results  Lab Exercise

Quality score and length distribution Mean lengthMedian lengthStd dev M

Quality score and length distribution Mean lengthMedian lengthStd dev M

Quality score and length distribution Mean lengthMedian lengthStd dev M

Quality score and length distribution Mean lengthMedian lengthStd dev M

Quality score and length distribution Mean lengthMedian lengthStd dev M

Quality score and length distribution Mean lengthMedian lengthStd dev M

Velvet IdKNo. of contigsN50Max lengthTotal length% reads used M M M M M M $> velveth -fasta -long $> velvetg Input: Fasta/Fastq Output: Fasta

WGS assembler (Celera) IdNo.of ContigsN50Max lengthTotal length% reads used M M M M M M $> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff $> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg Input: frg format Output: Fasta >50 separate programs make up the Celera Assembler pipeline runCA script helps manage them all

Newbler De Novo Assembly IdNo.of ContigsN50Max lengthTotal length M M M M M M Reference Assembly – (Haemophilus-influenzae-refseq.fasta ) IdNo.of ContigsN50Max lengthTotal length M M M M M M Input:.sff Output: Fasta $> runAssembly // de novo assembly

MIRA IdNo.of ContigsN50Max lengthTotal length% reads used M M M M M M MIRA stands for Mimicking Intelligent Read Assembly $> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff $> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log Input: Fasta + qual + trace info Output: Fasta, Ace

Eagle view - M19107.ace

Eagle view - M19501.ace

 “Next-generation DNA sequencing” Shendure et. al, NatureBiotechnology-2008.pdf NatureBiotechnology-2008.pdf  “Next-generation DNA sequencing methods” Mardis et. al, AnnuRevGenet-2008.pdf AnnuRevGenet-2008.pdf Works Cited

 Sequencing Methods  Experimental comparison of De Bruijn graph and Overlay graph assemblers  Preliminary Results  Lab Exercise

 Download the Lab Exercise file from the Genome Assembly wiki page Lab Exercise