CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] 06-11-2006 Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in.
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis
Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)
1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Introduction to bioinformatics 2007
Introduction to bioinformatics 2007
Global, local, repeated and overlaping
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 2 Local alignments

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] Sequence Analysis What can sequence alignment tell us about structure HSSP Sander & Schneider, 1991 ≥30% sequence identity

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] Sequence Analysis Pair-wise alignment Complexity of the problem Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~ alignments! T D W V T A L K T D W L - - I K

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value from residue exchange matrix i-1 i j-1 j

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value from residue exchange matrix i-1 i j-1 j How can we implement the affine gap penalty regime in this algorithm?

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] Sequence Analysis – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)=  k Affine: gp(k)=  +  k Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores over all alignment columns Dynamic programming Scoring alignments / Gap length penalty affine concave linear, General alignment score: S a,b =

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] Sequence Analysis Scoring and gap penalties Scoring: –DNA or protein? –Which evolutionary distance do we cover? Gap penalty: always less than 0 (i.e. they lower the alignment score) –Linear: g(k)=  k –Affine: g(k)=  +  k  harder to implement –Concave: g(k)=log(k)  no efficient solution

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] Sequence Analysis Global dynamic programming – general algorithm M[i-1,j-1] M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - g open - (i-x-1)g extension } max{M[i-1, 0<x<j-1] - g open - (i-y-1)g extension } Value form residue exchange matrix i-1 i j-1 j Gap open penalty Gap extension penalty Number of gap extensions This more general way of dynamic programming also allows for affine or other gap penalty regimes

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] Sequence Analysis Easy DP recipe for using linear, affine and concave gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check –Cell[i-1, j-1]: apply score for cell[i-1, j-1] –preceding row until j-2: apply appropriate gap penalties –preceding column until i-2: apply appropriate gap penalties i-1 j-1

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] Sequence Analysis Note about gap penalties Some affine schemes use gap_penalty = -g open –g extension *(l-1), while others use gap_penalty = -g open –g extension *l, where l is the length of the gap. One can be converted into the other by adapting the g open penalty

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] Sequence Analysis Global dynamic programming g open =10, g extension =2 DWVTALK T D W V L K DWVTALK T D W V L K These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] Sequence Analysis Variation on global alignment Global alignment: the previous algorithm is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG – Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other – Danger! – really bad alignments for divergent seqs seq X: seq Y:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] Sequence Analysis Global alignments - review Take two sequences: X[j] and Y[j] The best alignment for X[1…i] and Y[1…j] is called M[i, j] Initiation: M[0,0]=0 Apply the equation Find the alignment with backtracing X[j] Y[i] 2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] Sequence Analysis Global and local alignment Local (Semi-)Global A B A B A B BA AB Local (Semi-) Global A B A C C A BC ABC A B A C C Problem:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] Sequence Analysis Local alignment What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally maximal: cannot make it better by trimming/extending the alignment Seq X: Seq Y:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] Sequence Analysis Local alignment Why local? –Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence –Proteins have modular construction sharing domains between sequences Seq X: Seq Y: seq X: seq Y: seq X: seq Y:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] Sequence Analysis Domains - example Immunoglobulin domain

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] Sequence Analysis Global  local alignment Take the old, good equation Look at the result of the global alignment q e s - euqes- CAGCACTTGGATTCTCG- CA-C-----GATTCGT-G a) global align b) retrieve the result c) sum score along the result align. pos. sum

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] Sequence Analysis Local alignment – breaking the alignment A recipe –Just don’t let the score go below 0 –Start the new alignment when it happens –Where is the result in the matrix? Before:After: sum align. pos. q e s 0 - euqes- q e 0 s 0 - euqes- sum align. pos. sum align. pos.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] Sequence Analysis Local alignment – the equation Init the matrix with 0’s Fill in the search matrix (forward step) Read the maximal value from anywhere in the matrix Find the result by performing trace-back M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j]) max 0 Great contribution to science! q e s - euqes-

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] Sequence Analysis Finding second best alignment We can find the best local alignment in the sequence But where is the second best one? … 1613 A 1815 G A G A … …AGA… A clump Best alignment Scoring: 1 for match -2 for a gap

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] Sequence Analysis Clump of an alignment Alignments sharing at least one aligned pair A clump … A G 20 A 1916 G 18 A … …AGA…

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] Sequence Analysis Example: repeats proteins Local alignment traces showing similarity between repeats Note: repeats are similar motifs but need not be identical

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] Sequence Analysis Clumps gene Y gene X The figure shows two proteins with so- called tandem repeats (similar motifs adjacent in the sequence) ‘Shadows’ of various alignments are visible – these all derive from parts of a same locally optimal alignment

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] Sequence Analysis Finding second best alignment Don’t let any matched pair contribute to the next alignment Recalculate the clump “Clear” the best alignment

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] Sequence Analysis Clumps gene Y gene X The figure shows two proteins with so- called tandem repeats (similar motifs adjacent in the sequence) The highest scoring local alignment is indicated

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] Sequence Analysis Clumps gene Y gene X

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] Sequence Analysis Extraction of alignments – Waterman-Eggert algorithm 1. Repeat a.Retrieve the highest scoring alignment b.Set its trace to 0

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [29] Sequence Analysis General local DP algorithm for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until zero-valued cell or start of sequence(s) i-1 j-1 Penalty = P open + gap_length*P extension S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [30] Sequence Analysis Example: general algorithm for local alignment DP algorithm with affine gap penalties (PAM250, Po=10, Pe=2) Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences) DWVTALK T D W V L K DWVTALK T0003 D4000 W02100 V00259 L0011 K00 PAM250

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [31] Sequence Analysis Pitfalls of alignments  Alignment is often not a reconstruction of evolution (common ancestor is usually extinct by the time of alignment)  Repeats: matches to the same fragment seq X: seq Y:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [32] Sequence Analysis Summary 1. Global e.g Needleman-Wunsch algorithm 2. Semi-global 3. Local e.g.Smith-Waterman algorithm 4. Many local alignments aka Waterman-Eggert algorithm What’s the number of steps in these algorithms? How much memory is used? seq X: seq Y: seq X: seq Y: seq X: seq Y: seq X: seq Y: These questions deal with the algorithmic complexity (time, memory) of alignment

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [33] Sequence Analysis Global alignment (“simple” recursive DP) Fill pre-row and pre- column with gap penalties to account for initial end gaps Perform trace-back from lower-right matrix cell, thereby accounting for terminal end gaps T G A - GTGAG Match/mismatch = 1/-1 Constant gap penalty = -2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [34] Sequence Analysis Semi-global alignment Ignore initial end gaps –First row/column set to 0 Ignore terminal end gaps –Read the result from last row/column T G A - GTGAG Match/mismatch = 1/-1 Constant gap penalty = -2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [35] Sequence Analysis Local alignment Ignore cells that score <= 0 Read the result (local alignment score) from the highest cell anywhere in the matrix –Perform trace-back starting from this cell. G G A - GTGAG Match/mismatch = 1/-1 Constant gap penalty = -2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [36] Sequence Analysis Low complexity regions Local composition bias –Replication slippage: e.g. triplet repeats Results in spurious hits –Breaks down statistical models –Different proteins reported as hits due to similar composition –Up to ¼ of the sequence can be biased Widely used filtering program: SEGS (Wootton and Federhen, 1996) Huntington’s disease –Huntingtin gene of unknown (!) function –Repeats #: 6-35: normal; : disease

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [37] Sequence Analysis Synteny Synteny is preservation of gene order –5 to 20 million years of divergence Sequencing and comparison of yeast species to identify genes and regulatory elements, Manolis Kellis et.al, Nature 423, (15 May 2003) Arrows in figure represent genes; arrow direction indicates ‘+’ or ‘-’ strand of DNA Figure shows preservation of gene order in four yeast species

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [38] Sequence Analysis Genome vs. gene *) b stands for bases or nucleotides

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [39] Sequence Analysis Take-home messages (I) Know global, semi-global and local alignment (three types) Know three types of gap penalties ‘easy’ (fast) and general algorithm Pitfalls of global, semi-global and local alignment Low-complexity regions

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [40] Sequence Analysis Make sure you understand and can carry out 1. the ‘simple’ DP algorithm (for linear gap penalties, but with extension to affine penalties) for global, semi-global and local alignment 2. The general DP algorithm for global, semi- global and local alignment! Take-home messages (II)