DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
Heuristic alignment algorithms and cost matrices
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Assembly.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
De-novo Assembly Day 4.
Genetic Algorithm.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Fuzzypath – Algorithms, Applications and Future Developments
Metagenomics Assembly Hubert DENISE
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
1 NETTAB 2012 FILTERING WITH ALIGNMENT FREE DISTANCES FOR HIGH THROUGHPUT DNA READS ASSEMBLY Maria de Cola, Giovanni Felici, Daniele Santoni, Emanuel Weitschek.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Human Genome.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Hibridization: provide information about l-tuples present in DNA. DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Assembly algorithms for next-generation sequencing data
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Research in Computational Molecular Biology , Vol (2008)
Introduction to Genome Assembly
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fragment Assembly 7/30/2019.
Presentation transcript:

DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell

Outline DNA distribution  Counting problem DNA sequencing  Reassemble DNA fragments  Part of $1,000 genome sequencing

Bioinformatics What is bioinformatics ?  Intersection biology statistics computer science Bioinformatics problems  DNA sequences  Protein sequences/structures  Modeling/inference

DNA and RNA  DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides  A nucleotide has three parts: a sugar, a phophate and a base  Four bases (monomers)

 Double helix – 1953 Watson and Crick using X-ray diffraction  Sugar-phosphate backbone is the outer part of the helix  Two strands run in antiparallel directions  Two strands are complementary  Base pairing: A-T; G-C DNA Secondary Structure

 In double strands  # of A = # of T; # of G = # of C  Erwin Chargaff’s 1 st Parity Rule, 1951  In a single strand ?  # of A = # of T; # of G = # of C  Erwin Chargaff’s 2nd Parity Rule  How about oligomer (a few successive bases) frequencies ? Monomer Counts in DNA

Oligomer Frequencies  Oligomer length = k  Window of k sliding by one base (overlapping k-1 bases)  A simple counting program  May have to contend with long sequences  An oligomer and its reverse complement  ACT vs. AGT A C T A A G C G ……

Trimer Frequencies in Yeast

Trimer Frequencies in a Few Species

Observations  Symmetric in oligomer and reverse complements  Very similar among chromosomes  Issues  Two successive windows share (k-1) bases  Are they independent ? -- probably not (consider a large k)  Need jumping window  Given an oligomer distribution with k, generate a random sequence according to the same distribution  Distributions with (k+1) from the random and the real sequences

DNA Sequencing Problem A sample (genome) is amplified, and broken into multiple fragments Reconstruct the original sequence from its fragments (reads)

Shot-gun Sequencing

Base coverage  A sample (genome) is amplified  A base in the sample is copied into many reads  But, reads are randomly generated  Poisson distribution  How many copies of reads to cover the whole genome ?  Called sequencing depth (c)  From Poisson, prob. of non-coverage is P(X=0) = exp(-c)  # of uncovered bases G*exp(-c)<1  Human genome (3 Gb): c>22

Sequencing By Hybridization (SBH)  More recent sequencing machines produce short k-mers, called reads  Given an unknown DNA sequence, DNA array provides  All strings of length l that the sequence contains  No information about their positions Sequencing by Hybridization (SBH) Problem Reconstruct a string from its k - mer composition

SBH  Spectrum ( s, k )  For string s of length n, the k -mer composition with a multiset of n-k +1 k -mers in s  k =3, s =TATGGTGC  Spectrum( s.k ) = {TAT, ATG, TGG, GGT, GTG, TGC} Sequencing by Hybridization (SBH) Problem Reconstruct a string from its l- mer composition Input: A set S, representing all k -mers from unknown string s Output: string s such that Spectrum( s, k ) = S

Shortest Superstring Problem  Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s 1, s 2, …., s n Output: A shortest string s that contains all strings s 1, s 2, …., s n {001}

SBH as a Hamiltonian Path Problem  Two k -mers overlap if overlap(p,q) = k-1  Hamiltonian Path Problem  Given Spectrum ( s, k ), and a vertex for every k -mer in Spectrum ( s, k )  Connect every two vertices if two vertices overlap, and visit every vertex  Overlap-Layout-Consensus (OLC)  NP-complete

OLC  Conventional shotgun sequencing  Overlap-layout-consensus  Use a computer to search for overlap: trying for all possible pairs of fragments  Layout: putting fragments together  Consensus: error correction  Good for short prokaryote genomes  For n fragments, # of possible overlaps is 2n(n-1)  Difficult  No solution for “repeat problem” to find correct path in the layout step  Produce sequencing errors  Programs  PHRAP, CAP, TIGR, CELERA

SBH as an Eulerian Path Problem  Break up reads into short k -mers ( k = 27, for example)  A graph with all ( k -1)-mers : De Bruijn Graph  Edges corresponding to k -mers from Spectrum ( s, k )  Find a path visiting every edge exactly once

Eulerian Path Problem  Repeatedly find Eulerian cycles in the graph  Linear time

OLC vs. De Bruijn Graph

Scaffold linkage Connecting contigs Paired-end read Mate-pair

 ABySS (Assembly By Short Sequencing)  Simpson, 2009   Velvet  Zerbino and Birny, 2008   Euler  Pevzner, 2001    SOAPdenovo (Short Oligonucleotide Alignment Program)  Beijing Genomics Institute  SBH Programs

SBH Schematic Overview

ABySS  Proceeds in two stages  Stage 1  All possible k-mers are generated from reads  Remove read errors and construct initial contigs  Stage 2  Use mate-pairs to extend contigs  Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers

Velvet  Construct a graph  Transform reads into roadmaps  From a read, generate k-mers with read ID and position in the read (called roadmaps)  Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers  2 nd database  For each read, which k-mers are overlapped by subsequent reads  Trace reads through the graph using roadmaps

EULER, 2001  EULER  Implement Eulerian Path problem  Issues with real data  Reads may have errors  Error correction is typically done in ‘consensus’ stage  EULER corrects errors in the first step  Repeat problem  De Bruijn graph

EULER +, 2004  EULER+, 2004  A-Bruijn graph  To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l -mer in de Bruijn assembly  Graph simplication algorithms to remove errors in edges  De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing

EULER – SR, 2008  EULER-SR, 2008  Focus on memory-efficient algorithm dealing with Short Reads  Results  Error correction  E. coli – 68% error-free reads  99.6% errors are corrected  12h on a single processor, 1/2h for assembly

EULER – USR, 2009  EULER-USR  Show results of EULER-SR on error-prone Illumina read data  Show 35-nt reads are sufficient when mate-pairs are used

Comparisons

OLC vs. de Bruijn Graph

Difficulties in Graph-based Approaches  De Bruijn Graph (DBG) approach  A read ( bp) is broken into a number of k-mers  A read of 150 bp produces 123 k-mers when k=27 (n-k+1)  New sequencing machines produce longer reads  Problem Sizes  OLC  For a large genome, more than giga reads  Eulerian Path  k = 27 => ( k -1) = 26  4**(26) = 2**(52) (2**30 = 1 G)

New OLC Approach  Preserve read info  Data Partition  Partition data according to front and rear k-mers  Utilize known oligomer distribution of species

New OLC Approach  Native OLC  Pair-wise fragment overlap  Partial OLC  Partition data  Create consensus string

New OLC Approach  Data Partition  Front buckets of identical first k-mers of reads  Rear buckets of identical last k-mers of reads  k=15, 4 15 = 2 30 = 1 billion buckets * 2  Join F and R buckets with the same k-mers  Produce consensus contigs with base matches over a threshold  Repeat generating new F & R buckets,  Each time dealing with longer contigs  Per-bucket sequencing

Benefits  Hybrid of OLC and fixed-size de Bruijn graph  First k or last k bp’s can be rapidly compared  Parallelize the program with all possible combinations of s

Implementation  Dual Index Bucket Map data structure according to prefix and suffix  A set of reads  A hash table (Java HashMap) mapping prefixes to a Java Vector of references (pointers) to fragments in the set  Another hash table mapping suffixes to a vector of references to fragments in the set.  Bucket size is a parameter

Algorithms  Focused merge  Start with a random (or the largest fragment) and keep merging from there until there is no way to merge further or the resulting contig is at least as large as the original input sequence.  Sweep merge  Sweep through each bucket repeatedly, making the best merge each time until no merges can be done  Cutoff for maximum difference between merge of 2 fragments’ base ratios and the original sequence’s base ratios.  Heuristic “oversize penalty” for merges that are significantly larger than the original sequence  Heuristics for base ratio difference  Average: calculate the base ratios for each sequence, average the differences for each base  Maximum: take the maximum of the differences of ratio for any base.

Comparison of OLC algorithms Yeast genome Artificial Data with no read errors * with optional simulated annealing. StrategyContigsHeuristicTime Mixed Chromosome Support Worst Case Accuracy Ease of Use FocusedBeam Search* SingleBase ratios 2-15 seconds Unreliable/ Poor LowNot too bad SweepWide Search* MultipleBase ratios 1-30** seconds Excellent Good Needs Micro managi ng NaiveBrute Force SingleActual overlap ½ hour to a few hours NoneOptimalSimple

Summary  Base and oligomer distributions in species are unique  New OLC sequencing  DBG may not be suitable for longer reads  Utilize  Data partition  Base distribution