Assembly algorithms for next-generation sequencing data

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Algorithms for Multisample Read Binning
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
De-novo Assembly Day 4.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Universidad de los Andes, Bogotá, Colombia, Septiembre 2015  Sequence and annotation of genomes and metagenomes with Galaxy Dr. rer. nat. Diego Mauricio.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Metagenomics Assembly Hubert DENISE
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
1 NETTAB 2012 FILTERING WITH ALIGNMENT FREE DISTANCES FOR HIGH THROUGHPUT DNA READS ASSEMBLY Maria de Cola, Giovanni Felici, Daniele Santoni, Emanuel Weitschek.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
billion-piece genome puzzle
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
1 Aplicação de metodologias genómicas na detecção de polimorfismos no sobreiro Ciência 2010 Octávio S. Paulo Computational Biology and Population Genomics.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
Sequence Assembly.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Metafast High-throughput tool for metagenome comparison
Denovo genome assembly of Moniliophthora roreri
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Assembly.
Science of Information: Case Studies in DNA and RNA assembly
Small World Asynchronous Parallel Model for Genome Assembly
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
DNA Sequencing By Dan Massa.
Genome Sequencing and Assembly
Jintao Meng, PH.D candidate
Graph Algorithms in Bioinformatics
An Eulerian path approach to DNA fragment assembly
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,
Fragment Assembly 7/30/2019.
Presentation transcript:

Assembly algorithms for next-generation sequencing data Jason R. Miller, Sergey Koren, Granger Sutton

OUTLINE Introduction The Challenges of Assembly Graph Algorithms for Assembly: Greedy Graph-Based Assemblers Overlap/Layout/Consensus Assemblers The de Bruijn Graph Approach Future lines of questioning

(B) Filtering particles, typically by size. Sampling from habitat. (B) Filtering particles, typically by size. (C) DNA extraction and lysis. (D) Cloning and library. (E) Sequence the clones into reads. (F) Sequence assembly. Wooley JC, Godzik A, Friedberg I (2010) A Primer on Metagenomics. PLOS Computational Biology 6(2): e1000667. https://doi.org/10.1371/journal.pcbi.1000667 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000667

NEXT-GENERATION SEQUENCING The second-generation machines are characterized by: highly parallel operation. higher yield. simpler operation. much lower cost per read. shorter reads (unfortunately). Today's machines are commonly referred to as short-read sequencers or next-generation sequencers (NGS).

WHAT IS AN ASSEMBLY? An assembly is a data structure that maps the sequence data to a putative reconstruction of the target.

DE NOVO ASSEMBLY De novo assembly refers to reconstruction from scratch, without the aid of external data.

COVERAGE Coverage of a genome is defined as the mean number of times a nucleotide is being sequenced. Thus, 5X coverage means that each nucleotide in the genome is sequenced a mean number of five times.

THE CHALLENGES OF ASSEMBLY Repeat sequences: genomic regions that share perfect repeats can be indistinguishable. Sequencing error: can induce unreal assemblies. Non-uniform coverage: very low coverage induces gaps in assemblies. Coverage variability undermines coverage-based diagnostics and statistical tests designed to detect errors and repeats.

THE CHALLENGES OF ASSEMBLY Computational complexity: of processing larger volumes of data. Genomic diversity and variable abundance within populations: Assembly reconstructs the most abundant sequences, and coverage is usually incomplete. Furthermore, there is also the danger of assembling sequences from different species, creating interspecies chimeras.

GREEDY GRAPH-BASED ASSEMBLERS The greedy algorithms apply one basic operation: given any read or contig, find the read/contig with largest overlap. merge them into new contig. The basic operation is repeated until no more operations are possible.

OVERLAP/LAYOUT/CONSENSUS ASSEMBLERS The OLC approach has three phases: Overlap -  identifying all pairs of reads that overlap and build an overlap graph. Layout - simplify the overlap graph into approximate read layout (contigs). Consensus - determine the consensus sequence.

OVERLAP GRAPH Reads: Nodes represent the reads. Edges represent overlaps. 4 1 2 3 Reads: ACGCA CGCAT CATTC ATTCG TCGCG Finding the correct assembly is cast as a Hamiltonian path finding problem, for finding a path in a graph where each vertex is visited once:

K-MER “K-mer” is a substring of length K, where K is any positive integer. R: GGCGATTCATCG All 3-mers of R: GGC GCG CGA GAT ATT TTC TCA CAT ATC TCG

THE DE BRUIJN GRAPH The de Bruijn graph was developed outside the realm of DNA sequencing to represent strings from a finite alphabet. The nodes represent all possible fixed-length strings. The edges represent suffix-to-prefix perfect overlaps. A K-mer graph is a form of de Bruijn graph. Its nodes represent all the fixed-length subsequences (k-mers) drawn from a read. Its edges represent all the fixed-length overlaps between subsequences.

THE DE BRUIJN GRAPH Eulerian path Reads: ATGC TGCT GCTA CTAT k-mers (k=4) ATGC TGCT GCTA CTAT TATG ATGC TGCG GCGT Reads: ATGCTA CTATGC ATGCGT Eulerian path

DE NOVO ASSEMBLY SOFTWARE Greedy Assemblers: SSAKE, SHARCGS, VCAKE … OLC Assemblers: Newbler, CABOG, Edena, Shorty… DBG Assemblers: Euler, Velvet, AllPaths, ABySS, SOAP … Other software: PCAP, LOCAS, MIRA, Taipan, CLC Workbench, SeqMan …

EULER ASSEMBLER – ERROR CORRECTION The EULER assembler was the first to present this technique using de Bruijn graphs. Euler applies a filter to the reads before it builds its graph to identify sequencing error by comparing K-mer content between individual reads and all reads. It distrusts individual-read K-mers whose frequency in all reads is below a threshold. Euler corrects substitution errors. Finally, it either accepts a fully corrected read or rejects the read.  

EULER ASSEMBLER For example, sequencing error: CAGGTCT CAGCTCT CAG AGG

EULER ASSEMBLER For example, k-mer count profiles when errors are in different parts of the read GCGTATTACGCGTCTGGCCT: 

FUTURE LINES OF QUESTIONING Reads of the future will challenge assembly software in many ways: Almost certainly, data volume will continue to increase while manufacturing cost declines. The next-generation technology will surely be applied to larger genomes, more repetitive sequences, and less homogeneous samples. The quest for more powerful and efficient assembly software remains an area of critical research.

Thank you