Download presentation
Presentation is loading. Please wait.
1
Alignments and Comparative Genomics
2
Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia
3
Biology in One Slide – Twentieth Century …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… …and today
4
Complete DNA Sequences nearly 200 complete genomes have been sequenced
5
Evolution
6
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication
7
Evolutionary Rates OK X X Still OK? next generation
8
Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering the evolutionary forces
9
Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
10
What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? Alignment: A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other
11
Scoring Function Sequence edits:AGGCCTC Mutations AGGACTC Insertions AGGGCCTC Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches) m - (# mismatches) s – (#gaps) d
12
How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: O( 2 M+N )
13
Dynamic Programming Given two sequences x = x 1 ……x M and y = y 1 ……y N Let F(i, j) = Score of best alignment of x 1 ……x i to y 1 ……y j Then, F(M, N) == Score of best alignment Idea: Compute F(i, j) for all i and j Do this by using F(i–1, j), F(i, j–1), F(i–1, j–1)
14
Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - d
15
Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal Then, F(i-1, j-1) + s(x i, y j ) F(i, j) = maxF(i-1, j) – d F( i, j-1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not i-1, j-1i-1, j i, j-1i, j
16
Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A - TA
17
The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences
18
The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j d c.F(i, 0)= - i d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
19
Performance Time: O(NM) Space: O(NM)
20
Alignment on a Large Scale Given a newly sequenced organism, Which subregions align with other organisms? Potential genes Other biological characteristics Assume we use Dynamic Programming: The entire genomic database Our newly sequenced mammal 3 10 9 10 10 - 10 11
21
Index-based Local Alignment Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: Theoretical worst case: O(MN) Fast in practice query DB
22
Index-based Local Alignment — BLAST Dictionary: All words of length k (~11) Alignment initiated between exact-matching words (more generally, between words of alignment score T) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan
23
Index-based Local Alignment — BLAST A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC
24
Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Nearby alignments are merged Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT
25
Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi|28570323|gb|AC108906.9| Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi|28173089|gb|AC104321.7| Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911 http://www.ncbi.nlm.nih.gov
26
Efficient global alignment
27
Global alignment with the chaining approach 1.Find local alignments 2.Chain them into a rough global map 3.Align regions in-between
28
LAGAN: 1. FIND Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong B Do, et al.
29
LAGAN: 2. CHAIN Local Alignments 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong B Do, et al.
30
LAGAN: 3. Restricted DP 1.Find Local Alignments 2.Chain Local Alignments 3.Restricted DP Mike Brudno, Chuong B Do, et al.
31
Restricted DP (cont’d) What if a box is too large? Recursive application of LAGAN, more sensitive word search
32
Multiple Alignment
34
Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
35
Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)
36
Dynamic Programming for Multiple Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z
37
Progressive Alignment Multiple Alignment is NP-complete Most used heuristic: Progressive Alignment Algorithm: Until all sequences are aligned: –Align two (multi-)sequences to each other, and treat the result as a new sequence Example: aligning AACTGTA with AATGTC, gives AACTGTA AA-TGTC, with “letters” (AA), (AA), (C-), (TT), (GG), (TT), (AC) Running Time: O(NL 2 ), where N: #seqs, L: length of a seq
38
MLAGAN: Progressive Alignment Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN ) With needed generalizations for multi-anchoring & scoring edit distance Human Baboon Mouse Rat
39
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication
40
Local & Global Alignment AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC Local Global
41
Glocal Alignment Problem Find least cost transformation of one sequence into another using shuffle operations Sequence edits Inversions Translocations Duplications Combinations of above AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
42
SLAGAN: 1. Find Local Alignments 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts
43
SLAGAN: 2. Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts
44
SLAGAN: 2. Build Homology Map d a b c Chain using Sparse Dynamic Programming Penalties: a)regular b)translocation c)inversion d)inverted translocation
45
SLAGAN: 2. Build Homology Map 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts
46
SLAGAN: 3. Global Alignment 1.Find Local Alignments 2.Build Rough Homology Map 3.Globally Align Consistent Parts
47
SLAGAN Example: Chromosome 20 Human Chromosome 20 versus Mouse Chromosome 2 270 Segments of conserved synteny 70 Inversions
48
SLAGAN example: HOX cluster 10 paralogous genes Conserved order in Human/Mouse/Rat
49
SLAGAN example: HOX cluster 10 paralogous genes Conserved order in Human/Mouse/Rat
50
Whole-genome alignment with SLAGAN Two-step Shuffle 1.Shuffle for large-scale synteny map 2.Shuffle each syntenic region for microrearrangements
51
The ENCODE Project
52
ENCODE regions shuffled Hum/Mus Hum/Rat
53
ENCODE regions shuffled Hum/Mus Hum/Rat
54
ENCODE regions shuffled Hum/Mus Hum/Rat
55
ENCODE regions shuffled Hum/MusHum/Rat
56
ENCODE regions shuffled Hum/Mus Hum/Rat
57
Constrained Elements in Alignments
58
Human-Mouse-Rat Berkeley Genome Pipeline http://pipeline.lbl.gov
59
Human-Mouse-Rat
60
More DNA is coming…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.