Download presentation
Presentation is loading. Please wait.
1
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
2
DNA sequencing How we obtain the sequence of nucleotides of a species
5’ 3’ …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… 3’ 5’
3
Outline DNA Sequencing Technology Fragment Assembly Problem
Overlap-Layout-Consensus Sequencing by Hybridization Eulerian and Hamiltonian Graphs Next-generation DNA Sequencing
4
DNA Replication
5
DNA Sequencing – gel electrophoresis
Start at fixed location (primer) Grow DNA chain Include dideoxynucleoside (modified a, c, g, t) Stops reaction at all possible points Separate products with length, using gel electrophoresis
6
Technical Limitations
Need a lot of DNA Reaction only works for 500bp Solutions Biology Computer Science
7
DNA Sequencing – Cloning
Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + = Many host cells DNA amplification
8
Different types of vectors
Size of insert Plasmid 2,000-10,000 Can control the size Cosmid 40,000 BAC (Bacterial Artificial Chromosome) 70, ,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently
9
DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA
Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time
10
Electrophoresis diagrams
11
Challenging to read answer
12
Challenging to read answer
13
Challenging to read answer
14
Reading an electropherogram
Filtering Smoothening Correction for length compressions A method for calling the letters – PHRED PHRED – PHil’s Read EDitor (by Phil Green) Several better methods exist, but labs are reluctant to change
15
Output of PHRED: a read A read: 500-700 nucleotides
A C G A A T C A G …A …21 Quality scores: -10log10Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: (1990) Both leftmost & rightmost ends are sequenced, reads are paired
16
Method to sequence longer regions
genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~500 bp ~500 bp
17
Whole Genome Shotgun Sequencing
cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads (mate pair) known dist cosmids (40 Kbp) ~500 bp ~500 bp
18
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides
19
Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”)
20
Shortest Common Superstring Problem (SCS)
Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s1, s2,…., sn Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized Note: this formulation does not take into account sequencing errors
21
Shortest Common Superstring Problem: Example
22
Greedy Algorithm for SCS
Find pair of strings si, sj with longest overlap. Merge(si, sj) Recurse How “good” is the greedy algorithm?
23
Algorithm Analysis S = {s1, s2,…., sn } Opt(S ) = length of SCS Claim: Greedy solution ≤ 4 Opt(S ). Conjecture: Greedy solution ≤ 2 Opt(S ). [Best known bound 2.5] Theorem: SCS is NP–complete
24
SCS and Overlap Graph Build directed graph G = (V,E)
V = {s1, s2,…., sn } e = (si, sj) if prefix of sj matches suffix of si w(si, sj) = -length of overlap b/w si, sj Goal: Find a minimum weight path visiting every VERTEX exactly once in the OVERLAP graph: Travelling Salesman (Hamiltonian path) problem
25
SCS to TSP: An Example S = { ATC, CCA, CAG, TCC, AGT } SCS AGT CCA ATC
ATCCAGT TCC CAG TSP ATC 2 1 1 AGT 1 CCA 1 2 2 2 1 CAG TCC ATCCAGT
26
We need to use a linear-time algorithm
Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm
27
Strategies for whole-genome sequencing
Hierarchical – Clone-by-clone Break genome into many long pieces Map each long piece onto the genome Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat Online version of (1) – Walking Start sequencing each piece with shotgun Construct map as you go Example: Rice genome Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem
28
Reconstructing the Sequence (Fragment Assembly)
reads Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region
29
Definition of Coverage
Length of genomic segment: G Number of reads: N Length of each read: L Definition: Coverage c = n L / G How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides
30
Lander-Waterman Statistics
Given: N reads of length L from a genome of size G P(ζ covered by read) = 1 – (1 – L/G)N ≈1 – e-c, where c = N L / G is coverage P(ζ covered by read) ≈1 – e-c c P(covered) 1 0.632 2 0.864 4 0.982 8 0.999
31
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
32
Challenges in Fragment Assembly
Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Green and yellow fragments are interchangeable when assembling repetitive DNA
33
Repeat Types Bacterial genomes: 5% Mammals: 50% Repeat types:
Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies LTR retroposons (Long Terminal Repeats (~700 bp) at each end) cousins of HIV Gene Families genes duplicate & then diverge (paralogs) Recent duplications ~100,000-long, very similar copies
34
Triazzle: A Fun Example
The puzzle looks simple BUT there are repeats!!! The repeats make it very difficult. Try it – only $7.99 at
35
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions
36
What can we do about repeats?
Two main approaches: Cluster the reads Link the reads
37
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides A R B ARB, CRD or ARD, CRB ? C R D
38
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
39
Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment O(N2 L2) for N reads of length L Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
40
Overlapping Reads Sort all k-mers in reads (k ~ 24)
Find pairs of reads sharing a k-mer via hashing Extend to full alignment – throw away if not >95% similar TACA TAGT || TAGATTACACAGATTAC T GA TAGA | || ||||||||||||||||| TAGATTACACAGATTAC
41
Overlapping Reads and Repeats
A k-mer that appears M times, initiates M2 comparisons For an Alu that appears 106 times 1012 comparisons – too much Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10)
42
Finding Overlapping Reads
Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
43
Find Overlapping Reads
Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA
44
Layout Repeats are a major challenge
Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking – hide the repeats!!! Masking results in high rate of misassembly (up to 20%) Misassembly means alot more work at the finishing step
45
Merge Reads into Contigs
repeat region Unique Contig Overcollapsed Contig We want to merge reads up to potential repeat boundaries
46
Repeats, errors, and contig lengths
Repeats shorter than read length are OK Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to: Increase read length Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length
47
Merge Reads into Contigs (cont’d)
repeat region Ignore non-maximal reads Merge only maximal reads into contigs
48
Merge Reads into Contigs (cont’d)
repeat boundary??? sequencing error b a Ignore “hanging” reads, when detecting repeat boundaries
49
Merge Reads into Contigs (cont’d)
????? Unambiguous Insert non-maximal reads whenever unambiguous
50
Removing Repeats A-statistic (Myers) used in Celera assembler
Use Lander-Waterman statistics to estimate likelihood ratio of unique region vs. over-collapsed repeat Normal density Too dense Overcollapsed
51
Link Contigs into Supercontigs
Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? Scaffolding Problem: various heuristic approaches
52
Link Contigs into Supercontigs (cont’d)
Find all links between unique contigs Connect contigs incrementally, if 2 links supercontig (aka scaffold)
53
Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat contigs Use length/distance constraints
54
Link Contigs into Supercontigs (cont’d)
d ( A, B ) Contig A Define G = ( V, E ) V := contigs E := ( A, B ) such that d( A, B ) < C Reason to do so: Efficiency; full shortest paths cannot be computed Contig B
55
Link Contigs into Supercontigs (cont’d)
Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T
56
Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected
57
Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)
58
Some Assemblers PHRAP Celera Arachne Phusion Euler
Early assembler, widely used, good model of read errors Overlap O(n2) layout (no mate pairs) consensus Celera First assembler to handle large genomes (fly, human, mouse) Overlap layout consensus Arachne Public assembler (mouse, several fungi) Phusion Overlap clustering PHRAP assemblage consensus Euler Indexing Euler graph layout by picking paths consensus
59
EULER - A New Approach to Fragment Assembly
Traditional “overlap-layout-consensus” technique has a high rate of mis-assembly EULER uses the Eulerian Path approach borrowed from “sequencing by hybridization” (SBH) Fragment assembly without repeat masking can be done in linear time with greater accuracy
60
DNA Basepairing
61
DNA Microarrays
62
Sequencing by Hybridization (SBH)
Build a microarray with all 4l DNA sequences of length l (l ~ 20) For DNA sequence s, measure l-mer composition
63
l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}
64
The SBH Problem Goal: Reconstruct a string from its l-mer composition
Input: A multiset S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S
65
SBH: An Example S = { ATC, CCA, CAG, TCC, AGT } DNA Sequencing AGT CCA
ATCCAGT TCC CAG S = { ATC, CCA, CAG, TCC, AGT }
66
SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG } H ATG AGG TGC TCC GTC GGT GCA CAG ATG C A G G T C C Path visited every VERTEX once: Hamiltonian Path
67
SBH: Hamiltonian Path Approach
A more complicated graph: S = { ATG TGG TGC GTG GGC GCA GCG CGT }
68
SBH: Hamiltonian Path Approach
S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1: ATGCGTGGCA Path 2: ATGGCGTGCA
69
SBH: Eulerian Path Approach
S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S AT GT CG CA GC TG GG Path visited every EDGE once
70
SBH: Eulerian Path Approach
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions: GT CG GT CG AT TG GC AT TG GC CA CA GG GG ATGGCGTGCA ATGCGTGGCA
71
Euler Theorem A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.
72
Euler Theorem: Proof Eulerian → balanced
for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) Balanced → Eulerian ???
73
Algorithm for Constructing an Eulerian Cycle
a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.
74
Algorithm for Constructing an Eulerian Cycle (cont’d)
b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.
75
Algorithm for Constructing an Eulerian Cycle (cont’d)
c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).
76
Euler Theorem: Extension
Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced.
77
Sources Genomes sequenced: http://www.genome.gov/10002154
Serafim Batzoglou (Sequencing slides) (Euler slides)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.