CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly

Outline EULER fragment assembly
Mate-pairs, scaffolding and copy number Next-generation DNA Sequencing Cancer Genome Sequencing

Whole Genome Shotgun Sequencing
cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads (mate pair) known dist cosmids (40 Kbp) ~500 bp ~500 bp

Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Approaches to Fragment Assembly
Find a path visiting every VERTEX exactly once in the OVERLAP graph: Hamiltonian path problem NP-complete: algorithms unknown

Approaches to Fragment Assembly (cont’d)
Find a path visiting every EDGE exactly once in the REPEAT graph: Eulerian path problem Linear time algorithms are known

EULER - A New Approach to Fragment Assembly
Traditional “overlap-layout-consensus” technique has a high rate of mis-assembly EULER uses the Eulerian Path approach borrowed from “sequencing by hybridization” (SBH) Fragment assembly without repeat masking can be done in linear time with greater accuracy

Sequencing by Hybridization (SBH)
Build a microarray with all 4l DNA sequences of length l (l ~ 20) For DNA sequence s, measure l-mer composition

l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

The SBH Problem Goal: Reconstruct a string from its l-mer composition
Input: A multiset S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S

SBH: Eulerian Path Approach
S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S de Bruijn graph of S AT GT CG CA GC TG GG Path visited every EDGE once

SBH: Eulerian Path Approach
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions: GT CG GT CG AT TG GC AT TG GC CA CA GG GG ATGGCGTGCA ATGCGTGGCA

Euler Theorem A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

Euler Theorem: Proof Eulerian → balanced
for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) Balanced → Eulerian ???

Algorithm for Constructing an Eulerian Cycle
a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is balanced this dead end is necessarily the starting point, i.e., vertex v.

Algorithm for Constructing an Eulerian Cycle (cont’d)
b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.

Algorithm for Constructing an Eulerian Cycle (cont’d)
c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).

Overlap Graph: Hamiltonian Approach
Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others. Repeat Find a path visiting every VERTEX exactly once: Hamiltonian path problem

Overlap Graph: Eulerian Approach
Repeat Placing each repeat edge together gives a clear progression of the path through the entire sequence. Find a path visiting every EDGE exactly once: Eulerian path problem

Multiple Repeats Can be easily constructed with any number of repeats

Repeat Graph (a) DNA sequence with a triple repeat R;
(b) the layout graph; (c) construction of the de Bruijn graph by gluing repeats; (d) de Bruijn graph. Pevzner P. A. et.al. PNAS 2001;98:

Building Repeat Graph Problem: Construct the repeat graph from a collection of reads. Solution: Break the reads into smaller pieces. ?

Building Repeat Graph Reads are constructed from an original sequence in lengths that allow biologists a high level of certainty. They are then broken again into k-mers

EULER Fragment Assembly Approach
Input: Reads s1, …, sN Further subdivide reads into k-mers (k = 20) Build repeat graph on resulting k-mers Each read is path in resulting graph. Solve Eulerian Superpath Problem. Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths.

Repeat Graph Reads = {ATGGC, GGCGTG, GTGCA}
Vertices correspond to ( k – 1 ) – mers in each read Edges correspond to k – mers in each read Example: S = ATGGCGTGCA Reads = {ATGGC, GGCGTG, GTGCA} 3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } AT GT CG CA GC TG GG Two Eulerian paths: (visit every EDGE once) ATGCGTGGCA ATGGCGTGCA

Reads in Repeat Graph Reads = {ATGGC, GGCGTG, GTGCA}
Example: S = ATGGCGTGCA Reads = {ATGGC, GGCGTG, GTGCA} 3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA

Additional challenges in EULER Approach
Errors in reads Reverse-complement of DNA string Using mate-pair information to simplify the repeat graph. Multiplicities of edges generally unknown (Copy number problem).

Sequencing Errors If an error exists in one of the 20-mer reads, the error will be perpetuated among all of the smaller pieces broken from that read.

Sequencing Errors However, an error will not be present in the other instances of the 20-mer read. “Consensus first” approach Let T = {all l-tuples appearing in > M reads} A string s is called a T-string if all its l-tuples belong to T. Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string.

Sequencing Errors Solving Spectral Alignment Problem attempts to eliminate most point mutation errors before reconstructing the original sequence. Not perfect!

Forward and Reverse Complements
5’ 3’ 3’ 5’ We obtain reads from both strands of DNA. Do not know strand of origin. s = CAGT s’ = ACTG (reverse complement)

Forward and Reverse Complements
5’ 3’ In Euler assembler, include reverse complement of each read. “assume that S contains a complement of every read and that the de Bruijn graph can be partitioned into two subgraphs (the “canonical” one and its reverse complement)” 3’ 5’ Alternative approaches using bidirected graphs.

Using Mate-Pair Information
Repeats and other ambiguities lead to tangles in repeat graph 1 3 1  3 and 2 4 OR 1 4 and 2 3 ? 2 4 A repeat v1 … vn and a system of paths overlapping with this repeat

Mate-pair (r1, r2) gives pair of positions in G. Find path P in G from r1 to r2. l(r1, r2) r1 r2 1 2 3 4 r1 r2 d(r1, r2) If unique path P with d(r1, r2) ≈ l(r1, r2) length of mate pair, then use P as “long read” in superpath algorithm

Scaffolding

Copy number problem Let d(v) = in degree – outdegree
Balanced graph: d(v) = 0 for all v. Goal: Introduce multiplicities on edges so that graph is balanced.

Copy number problem Goal: Introduce multiplicities on edges so that graph is balanced. Use as few extra edges as possible. Balance each vertex by adding edge multiplicities Assign flow f(e) to each edge such that d(v) = 0 for all vertices.

Copy number problem Let d(v) = indegree – outdegree
Balanced graph: d(v) = 0 for all v. Graph G = (V, e, w). Weights w(e) = 1 for all e. Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e.

Copy number problem Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Min-flow Max-cut Theorem: For a directed acyclic graph G = (V, e, w) with lower capacity bounds: min flow from v to w = capacity of the maximum cut separating v from w

Copy number problem Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Min-cost circulation (See Myers 2005): Assign cost c(e) = 1 to each edge. min Σc(e) f(e) such that f(e) ≥ w(e) for all e. d(v) = 0 for all vertices.

Next-generation sequence platforms
454 Illumina ABI Solid solid.appliedbiosystems.com Current interest in fragment assembly problem for these technologies.

Polony Sequencing "Polonies" are tiny colonies of DNA, about one micron in diameter, grown on a glass microscope slide (the word itself is a contraction of "polymerase colony"). To create them, researchers first pour a solution containing chopped-up DNA onto the slide. Adding an enzyme called polymerase causes each piece to copy itself repeatedly, creating millions of polonies, each dot containing only copies of the original piece of DNA. The polonies are then exposed to a series of chemically-labeled probes that light up when run through a scanning machine, identifying each nucleotide base in the strand of code, much as dusting with powder allows crime-scene investigators to bring up fingerprints on a surface. Prior to sequencing, dsDNA is denatured and unbound copy strands are washed away. - Covalently linked template strands allow for washing.

Polony sequencing—Assembly
? Resulting reads are likely to look different than Sanger reads: Short (currently 100 to 200 bp) Low error rates, except in homopolymeric runs (AAA…, CCC…, etc) Currently, not known how to do paired reads on a chip. Maybe very soon!

454 Sequencing

Illumina Sequencing

Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm
Figure 1 . A nanopore sensor for sequencing DNA. A channel or nanopore in an insulating membrane separates two ionic solution-filled compartments. In response to a voltage bias (labeled “ - ” and “+”) across the membrane, ssDNA molecules (yellow) in the “-” compartment are driven, one at a time, into and through the channel. Embedded in the membrane, an electrically connected nanotube (orange) that abuts on the nanopore serves as a sensor to identify the nucleotides in the translocating DNA molecules. Elevated temperatures and denaturants maintain the DNA in an unstructured, single-stranded form. The underlying principle of nanopore sequencing is that a single-stranded DNA or RNA molecule can be electrophoretically driven through a nano-scale pore in such a way that the molecule traverses the pore in strict linear sequence, as illustrated in Figure 1. Because a translocating molecule partially obstructs or blocks the nanopore, it alters the pore's electrical properties 1 .

Nanopore Sequencing—Assembly
Resulting reads are likely to look different than Sanger reads: Long (perhaps 10,000bp-1,000,000bp) High error rate (perhaps 10% – 30%) Two colors? A/ CTG AT/ CG AG/ CT How can we assemble under such conditions?

Some future directions for sequencing
Personalized genome sequencing Find your ~1,000,000 single nucleotide polymorphisms (SNPs) Find your rearrangements Goals: Link genome with phenotype Provide personalized diet and medicine (???) designer babies, big-brother insurance companies Timeline: Inexpensive sequencing: Genotype–phenotype association: 2010-??? Personalized drugs: ???

2. Environmental sequencing Find your flora: organisms living in your body External organs: skin, mucous membranes Gut, mouth, etc. Normal flora: >200 species, >trillions of individuals Flora–disease, flora–non-optimal health associations Timeline: Inexpensive research sequencing: today Research & associations within next 10 years Personalized sequencing Find diversity of organisms living in different environments Hard to isolate Assembly of all organisms at once

Organism sequencing Sequence a large fraction of all organisms Deduce ancestors Reconstruct ancestral genomes Synthesize ancestral genomes Clone—Jurassic park! Study evolution of function Find functional elements within a genome How those evolved in different organisms Find how modules/machines composed of many genes evolved

DNA Sequencing – Recap 1975 Gel electrophoresis
Predominant, old technology by F. Sanger Whole genome strategies Physical mapping Walking Shotgun sequencing Computational fragment assembly The future—new sequencing technologies Pyrosequencing, single molecule methods, … Assembly techniques Future variants of sequencing Resequencing of humans Cancer genome sequencing Microbial and environmental sequencing 2015

Cell Division and Mutation
Single nucleotide change A major contributor to the development of cancer are somatic mutations that occur during cell division Will focus on structural and later copy number, which is not to say that single are not as important. What is the effect of structural changes Copy number Structural

Rearrangements in Cancer
1) Change gene structure, create novel fusion genes Gleevec targets ABL-BCR fusion 2) Alter gene regulation Burkitt’s lymphoma IMAGE CREDIT: Gregory Schuler, NCBI, NIH, Bethesda, MD

Cancer Genomes Fusion gene in >50% prostate cancer patients
Classically, discovered through cytogenetic techniques like chromosome painting shown here. More complicated in solid tumors. Don’t reveal detailed architecture, genes affected. Rearrangements thought to be mostly noise. Recent identification of new fusion gene. Can we use genome sequencing? Fusion gene in >50% prostate cancer patients (Tomlins et al.Science Oct. 2005)

End Sequence Profiling (ESP) C. Collins and S. Volik (2003)
Pieces of cancer genome: clones ( kb). Cancer DNA Sequence ends of clones (500bp). Map end sequences to human genome. Because of end sequencing protocol, clones have direction x y Human DNA Each clone corresponds to pair of end sequences (ES pair) (x,y). Retain clones that correspond to a unique ES pair.

Pieces of cancer genome: clones ( kb). Cancer DNA Sequence ends of clones (500bp). L Valid ES pairs Lmin ≤ y – x ≤ Lmax, min (max) size of clone. Convergent orientation. Map end sequences to human genome. Because of end sequencing protocol, clones have direction x y Human DNA

Pieces of cancer genome: clones ( kb). Cancer DNA Sequence ends of clones (500bp). L Map end sequences to human genome. Because of end sequencing protocol, clones have direction. Some pairs cannot be mapped due to repeats in human genome. x y a b Human DNA Invalid ES pairs Putative rearrangement in cancer ES directions toward breakpoints (a,b): Lmin ≤ |x-a| + |y-b| ≤ Lmax a x y b

Sources Serafim Batzoglou (Sequencing slides) (Euler slides)

CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly

Similar presentations

Presentation on theme: "CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly

Similar presentations

Presentation on theme: "CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly"— Presentation transcript:

Similar presentations

About project

Feedback