1 Выравнивание двух последовательностей. 2 AGC A A A C -2-4-60 -2 -4 -6 -8 1 0 -3 -2 -5-4.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Pairwise Sequence Alignment
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment Tutorial #2
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Tutorial #2
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Dynamic Programming and Biological Sequence Comparison Part I.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Introduction to Bioinformatics Algorithms Sequence Alignment.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Class 2: Basic Sequence Alignment
1 Lesson 3 Aligning sequences and searching databases.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
Sequence comparison: Local alignment
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Sequence comparison: Local alignment
#7 Still more DP, Scoring Matrices
Intro to Alignment Algorithms: Global and Local
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Dynamic Programming-- Longest Common Subsequence
Sequence Alignment Tutorial #2
Presentation transcript:

1 Выравнивание двух последовательностей

2 AGC A A A C

3 Sequence comparison: Motivation Finding similarity between sequences is important for many biological questions. u Find homologous proteins  Allows to predict structure and function u Locate similar subsequences in DNA  e.g: allows to identify regulatory elements u Locate DNA sequences that might overlap  Helps in sequence assembly

4 Dot plots u Not technically an “alignment” u But gives picture of correspondence between pairs of sequences u Dot represents similarity between segments of the two sequences

5 Sequence Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences u Two basic variants of sequence alignment:  Global – all characters in both sequences participate  Needleman-Wunsch, 1970 Needleman-Wunsch, 1970  Local – find related regions within sequences  Smith-Waterman, 1981 Smith-Waterman, 1981

6 Sequence Alignment - Example Input: GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA Possible output: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A u Three elements:  Perfect matches  Mismatches  Insertions & deletions (indel)

7 Scoring Function u Score each position independently:  Match: +1  Mismatch: -1  Indel: -2 u Score of an alignment is sum of position scores u Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23

8 Homology Example: Evolution of the Globins

9 Sequence vs. Structure Similarity Sequence 1 lcl|1A6M:_ MYOGLOBIN Length 151 (1..151) Sequence 2 lcl|1JL7:A MONOMER HEMOGLOBIN COMPONENT III Length 147 (1..147) Score = 31.6 bits (70), Expect = 10 Identities = 33/137 (24%), Positives = 55/137 (40%), Gaps = 17/137 (12%) Query: 2 LSEGEWQLVLHVWAKVEA--DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 59 LS + Q+V W + + AG G++ L + +HPE F + Sbjct: 2 LSAAQRQVVASTWKDIAGADNGAGVGKECLSKFISAHPEMAAVFG FSGASDP 53 Query: 60 DLKKHGVTVLTALGAI---LKKKGHHEAELKPLAQSH---ATKHKIPIKYLEFISEAIIH G VL +G L +G AE+K + H KH I +Y E Sbjct: 54 GVAELGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGNKH-IKAEYFEPLGASLLS 112 Query: 114 VLHSRHPGDFGADAQGA R G A A+ A Sbjct: 113 AMEHRIGGKMNAAAKDA 129

10 Example Alignment: Globins u figure at right shows prototypical structure of globins u figure below shows part of alignment for 8 globins (-’s indicate gaps)

11 Insertions/Deletions and Protein Structure loop structures: insertions/deletions here not so significant u Why is it that two “similar” sequences may have large insertions/deletions?  some insertions and deletions may not significantly affect the structure of a protein

12 Sequence vs. Structure Similarity 1A6M: Myoglobin1JL7: Hemoglobin u Myoglobin and hemoglobin are similar, but slight differences in structure let them perform different functions.

13 Myoglobin & Hemoglobin u /structure/HbMb/hbmb.htm Красивые ролики по структуре миоглобина и гемоглобина

14 The Space of Global Alignments  some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS ELV-- --VIS E-LV VIS- EL-V -VIS

15 Number of Possible Alignments u given sequences of length m and n u assume we don’t count as distinct and u we can have as few as 0 and as many as min{m, n} aligned pairs u therefore the number of possible alignments is given by C- -G -C G-

16 Number of Possible Alignments u there are possible global alignments for 2 sequences of length n e.g. two sequences of length 100 have possible alignments but we can use dynamic programming to find an optimal alignment efficiently

17 Dynamic Programming u Algorithmic technique for optimization problems that have two properties:  Optimal substructure: Optimal solution can be computed from optimal solutions to subproblems  Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small

18 Dynamic Programming u Break problem into overlapping subproblems u use memoization: remember solutions to subproblems that we have already seen

19 Fibonacci example u 1,1,2,3,5,8,13,21,... u fib(n) = fib(n - 2) + fib(n - 1) u Could implement as a simple recursive function u However, complexity of simple recursive function is exponential in n

20 Fibonacci dynamic programming u Two approaches 1. Memoization: Store results from previous calls of function in a table (top down approach) 2. Solve subproblems from smallest to largest, storing results in table (bottom up approach) u Both require evaluating all (n-1) subproblems only once: O(n)

21 Dynamic Programming Graphs u Dynamic programming algorithms can be represented by a directed acyclic graph  Each subproblem is a vertex  Direct dependencies between subproblems are edges graph for fib(6)

22 Global Alignment u Input: two sequences over the same alphabet u Output: an alignment of the two sequences in which all characters in both sequences participate u The Needleman-Wunsch algorithm finds an optimal global alignment between two sequences  Uses a scoring function  A dynamic programming algorithm

23 Dynamic Programming Idea  consider last step in computing alignment of AAAC with AGC u three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA CAG CAAAC CAG -AAA -AGC C consider best alignment of these prefixes score of aligning this pair +

24 u Suppose we have two sequences:  s=s 1 …s n and t=t 1 …t m u Construct a matrix V[n+1, m+1] in which V(i, j) contains the score for the best alignment between s 1 …s i and t 1 …t j.  The grade for cell V(i, j) is: V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) u d- штраф за открытие разрыва (gap-open) - linear gap penalty u V(n,m) is the score for the best alignment between s and t The Needleman-Wunsch (NW) Algorithm

25 NW Algorithm – An Example u Alphabet:  DNA, ∑ = {A,C,G,T} u Input:  s = AAAC  t = AGC u Scoring scheme:  Match: score (x, x) = 1  Mismatch: score (x, y) = -1  Gap Opening d = -2 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )

26 Initializing Matrix: Global Alignment with Linear Gap Penalty A s A 2s2s CAG A 3s3s C 4s4s 0 3s3ss2s2s

27 NW Algorithm – An Example AGC A A A C AG-C AAAC -AGC AAAC A-GC AAAC V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j ) Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Match: score (x, x) = 1 Mismatch: score (x, y) = -1 Gap Opening d = -2 Обратный проход: движемся обратно по тем ячейкам, из которых было вычислено u Лучший вес по определению

28 NW – Time and Space Complexity Time: u Filling the matrix: u Backtracing: u Overall: Space: u Holding the matrix: AGC A A A C O(n·m) O(n+m) O(n·m)

29 Local Alignment Motivation u useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere u useful for comparing DNA sequences that share a similar motif but differ elsewhere u useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) u more sensitive when comparing highly diverged sequences

30 Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene

31 Structure of a genome ABMake DC If B then NOT D If A and B then D Make BD If D then B C gene D gene B short sequences regulate expression of genes lots of “junk” sequence e.g. ~50% repeats selfish DNA

32 Cross-species genome similarity u 98% of genes are conserved between any two mammals u ~75% average similarity in protein sequence hum_a : 57331/ mus_a : 78560/ rat_a : / fug_a : 36008/68174 hum_a : 57381/ mus_a : 78610/ rat_a : / fug_a : 36058/68174 hum_a : 57431/ mus_a : 78659/ rat_a : / fug_a : 36084/68174 hum_a : 57481/ mus_a : 78708/ rat_a : / fug_a : 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

33 The local alignment problem Given two strings x = x 1 ……x M, y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g.x = aaaacccccgggg y = cccgggaaccaacc

34 Smith-Waterman Algorithm u Два отличия от Нидлмана-Вунша  Для каждого элемента матрицы дана возможность принять значение, равное нулю, если все другие значения отрицательны  Выравнивание может заканчиваться в любом месте таблицы. Лучший вес - наибольшее значение всей матрицы. Оттуда и начинается обратный проход 0 V(i-1, j)+ d V(i, j) = max V(i, j-1)+ d V(i-1, j-1)+ score (s i, t j )

35 Local Alignment u Let gap = -2 match = 1 mismatch = -1. GATCACCTGATACCC GATCACCT GAT _ ACCC

36 Overlap Alignment Перекрывающиеся выравнивания Consider the following problem: Find the most significant overlap between two sequences S,T ? Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences. Мы хотим получить разновидность глобального выравнивания, но в котором нет штрафа за свисающие концы То есть выравнивание начиналось на левой или верхней границе матрицы, а заканчивалось на правой или нижней

37 Формально: Исходя из S[ 1..n ], T[ 1..m ] найти i,j такие что d - максимально, где d: d=max{D(S[1..i],T[j..m]), D(S[i..n],T[1..j]), D(S[1..n],T[i..j]), D(S[i..j],T[1..m]) }. Решение: То же самое, что и глобальное выравнивание, за исключением того, что мы не штрафуем за висящие концы. Overlap Alignment

38 u Recurrence: as in global alignment u Score: maximum value at the bottom line and rightmost line Overlap Alignment  Initialization: V[i,0]=0, V[0,j]=0 globallocaloverlap

39 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5

40 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5

41 Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme: u Match: +4 u Mismatch: -1 u Indel: -5

42 The best overlap is: PAWHEAE HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE- Overlap Alignment (Example) Scoring scheme : u Match: +4 u Mismatch: -1 u Indel: -5 -2

43 Динамическое программирование с более сложными моделями u До сих пор мы рассматривали простейшую модель разрывов, где штраф d - линейно зависел от его длины. Каждый следующий остаток наказывается так же, как и первый.  (g)= - nd n - число остатков, d - штраф за открытие разрыва u Введем аффинную функцию.  (n)= -d-(n-1)e n - число остатков, d - штраф за открытие разрыва, а e - штраф за его продолжение

44 Dynamic Programming for the Affine Gap Penalty Case u to do in time, need 3 matrices instead of 1 best score given that y[j] is aligned to a gap best score given that x[i] is aligned to a gap best score given that x[i] is aligned to y[j] IGAx i LGVy i AIGAx i GVy i -- GAx i -- SLGVy i

45 Why Three Matrices Are Needed WFP F W S( F, W ) = 1 S( W, W ) = 11 S( F, F ) = 6 S( W, P ) = -4 S( F, P ) = -4  consider aligning the sequences WFP and FW using d= -4 (gap opening), e = -1 (gap extension) and the following values from the BLOSUM-62 substitution matrix: the matrix shows the highest-scoring partial alignment for each pair of prefixes -WFP FW-- optimal alignment best alignment of these prefixes; to get optimal alignment, need to also remember WF FW -WF FW-

46 Global Alignment DP for the Affine Gap Penalty Case d+e e e M Ix Iy M Ix Iy M Ix Iy M Ix Iy M Ix Iy

47 Global Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest of –stop at any of –note that pointers may traverse all three matrices d+e

48 Global Alignment Example M : 0 I x : -3 I y : -3 -∞-∞ -∞ -4 -∞ -5 -∞ -7 -∞ -6 -∞ ∞ -∞-∞ -5 -∞ ∞ ∞ ∞ -6 -∞-∞ -4 -∞-∞ -3 -∞-∞ ∞ -5 -∞ ∞ ∞ -6 -∞ A CACT A A T ACACT --AAT ACACT A--AT ACACT AA--T three optimal alignments:

49 Local Alignment DP for the Affine Gap Penalty Case d+e e e

50 Local Alignment DP for the Affine Gap Penalty Case u initialization traceback –start at largest –stop at

51 Computational Complexity and Gap Penalty Functions u linear: affine: general: concave  assuming two sequences of length n

52 (Global) with General Gap Penalty Function consider every previous element in the row consider every previous element in the column

53 Finite State Automation (FSA) Конечный автомат u В теории алгоритмов такая система называеся конечным автоматом u Выравнивание соответствует пути через состояния автомата, а символы в выравнивании переписаны из исходных последовательностей согласно значениям состояний

54

55 Additional (optional)

56 Semi-global Alignment Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.

57 Global Alignment Example: AAACCC A  CCC Prefer to see: AAACCC AAACCC   ACCC   ACCC Do not want to penalize the end spaces empty AAACCC A C C C

58 SemiGlobal Alignment Example: s = AAACCC t =   ACCC empty AAACCC A C C C

59 SemiGlobal Alignment Example: s = AAACCCG t =   ACCC  empty AAACCC A C C C G

60 SemiGlobal Alignment u Summary of end space charging procedures: Place where spaces are not penalized for Action Beginning of 1 st sequence End of 1 st sequence Beginning of 2 nd sequence End of 2 nd sequence Initialize 1 st row with zeros Look for max in last row Initialize 1 st column with zeros Look for max in last column