Download presentation
Presentation is loading. Please wait.
1
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm
2
CS262 Lecture 9, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig
3
CS262 Lecture 9, Win07, Batzoglou 1. Find Overlapping Reads Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats A k-mer that occurs N times, causes O(N 2 ) read/read comparisons ALU k-mers could cause up to 1,000,000 2 comparisons Solution: Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available
4
CS262 Lecture 9, Win07, Batzoglou 1. Find Overlapping Reads Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors
5
CS262 Lecture 9, Win07, Batzoglou 2. Merge Reads into Contigs Overlap graph: Nodes: reads r 1 …..r n Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat
6
CS262 Lecture 9, Win07, Batzoglou 2. Merge Reads into Contigs
7
CS262 Lecture 9, Win07, Batzoglou Overlap graph after forming contigs Unitigs: Gene Myers, 95
8
CS262 Lecture 9, Win07, Batzoglou Repeats, errors, and contig lengths Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to: Increase read length Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length
9
CS262 Lecture 9, Win07, Batzoglou 3. Link Contigs into Supercontigs Too dense Overcollapsed Inconsistent links Overcollapsed? Normal density
10
CS262 Lecture 9, Win07, Batzoglou Find all links between unique contigs 3. Link Contigs into Supercontigs Connect contigs incrementally, if 2 links supercontig (aka scaffold)
11
CS262 Lecture 9, Win07, Batzoglou Fill gaps in supercontigs with paths of repeat contigs 3. Link Contigs into Supercontigs
12
CS262 Lecture 9, Win07, Batzoglou 4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)
13
CS262 Lecture 9, Win07, Batzoglou Some Assemblers PHRAP Early assembler, widely used, good model of read errors Overlap O(n 2 ) layout (no mate pairs) consensus Celera First assembler to handle large genomes (fly, human, mouse) Overlap layout consensus Arachne Public assembler (mouse, several fungi) Overlap layout consensus Phusion Overlap clustering PHRAP assemblage consensus Euler Indexing Euler graph layout by picking paths consensus
14
CS262 Lecture 9, Win07, Batzoglou Quality of assemblies Celera’s assemblies of human and mouse
15
CS262 Lecture 9, Win07, Batzoglou Quality of assemblies—mouse
16
CS262 Lecture 9, Win07, Batzoglou Quality of assemblies—mouse N50 contig length Terminology: N50 contig length If we sort contigs from largest to smallest, and start Covering the genome in that order, N50 is the length Of the contig that just covers the 50 th percentile.
17
CS262 Lecture 9, Win07, Batzoglou Quality of assemblies—rat
18
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present human (3Gbp), mouse (2.5Gbp), rat *, chicken, dog, chimpanzee, several fungal genomes Gene Myers Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway Phil Green 1997
19
CS262 Lecture 9, Win07, Batzoglou Genomes Sequenced http://www.genome.gov/10002154
20
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments
21
CS262 Lecture 9, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication
22
CS262 Lecture 9, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence
23
CS262 Lecture 9, Win07, Batzoglou Orthology and Paralogy HB Human WB Worm HA1 Human HA2 Human Yeast WA Worm Orthologs: Derived by speciation Paralogs: Everything else Orthologs: Derived by speciation Paralogs: Everything else
24
CS262 Lecture 9, Win07, Batzoglou Orthology, Paralogy, Inparalogs, Outparalogs
25
CS262 Lecture 9, Win07, Batzoglou
26
Definition Given N sequences x 1, x 2,…, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology The patterns of conservation can help us tell function of the element
27
CS262 Lecture 9, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
28
CS262 Lecture 9, Win07, Batzoglou Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) = k<l w kl s(m k, m l ) Duck
29
CS262 Lecture 9, Win07, Batzoglou A Profile Representation Given a multiple alignment M = m 1 …m n Replace each column m i with profile entry p i Frequency of each letter in # gaps Optional: # gap openings, extensions, closings Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4
30
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments Algorithms
31
CS262 Lecture 9, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))
32
CS262 Lecture 9, Win07, Batzoglou Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(x i, x j, x k ), F(i – 1, j – 1, k ) + S(x i, x j, - ), F(i – 1, j, k – 1) + S(x i, -, x k ), F(i – 1, j, k ) + S(x i, -, - ), F(i, j – 1, k – 1) + S( -, x j, x k ), F(i, j – 1, k ) + S( -, x j, - ), F(i, j, k – 1) + S( -, -, x k ) } Multidimensional DP
33
CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP
34
CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly! Require 2 N – 1 states, one per combination of gapped/ungapped sequences Running time: O(2 N 2 N L N ) = O(4 N L N ) XYXYZZ YYZ XXZ
35
CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw
36
CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)
37
CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree x w y z ?
38
CS262 Lecture 9, Win07, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …
39
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT
40
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary
41
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA
42
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing
43
CS262 Lecture 9, Win07, Batzoglou Consistency z x y xixi yjyj y j’ zkzk
44
CS262 Lecture 9, Win07, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment, For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz ) Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk
45
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE High throughput One of the best in accuracy ProbCons High accuracy Reasonable speed
46
CS262 Lecture 9, Win07, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept
47
CS262 Lecture 9, Win07, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N sequence z k M xz (i, k) M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes aligned (i, j) in A M’ xy (i, j) Define E(x, y) = aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles
48
CS262 Lecture 9, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.