C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 02-11-2006 Alignments 1 Sequence Analysis.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis
1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Introduction to bioinformatics 2007
Introduction to bioinformatics 2007
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] Sequence Analysis Searching for similarities What is the function of a new gene? The “lazy” investigation: – Find a set of similar proteins – Identify similarities and differences – For long proteins: identify domains Domains are structural units in a protein tertiary structure and often provide a given (sub)function to the complete protein

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] Sequence Analysis Is similarity really interesting? Common ancestry is a very important observation Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – It’s a nice tool: When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z X G

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] Sequence Analysis Functional and evolutionary Evolutionary relation, reconstruction: – Based on sequence Identity (simplest method) Similarity – Homology (the ultimate goal) – Other (e.g., 3D structure) Functional relation Sequence  Structure  Function determines

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] Sequence Analysis Evolution and 3d protein structure information Isocitrate dehydrogenise: The distance from the active site (yellow) determines the rate of evolution. (red = fast evolution blue = slow evolution) Dean, A. M. and G. B. Golding, Pacific Symposium on Bioinformatics 2000

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] Sequence Analysis How to determine similarity? Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work We’ll use only these Z X Y Common ancestor, usually extinct available

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] Sequence Analysis Alignment Mutations: substitution, insertion and deletion Which alignment is better? Use common sense and call it: – Simplest – Most probable – Maximum likely

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] Sequence Analysis Scoring Should give reasonable alignments And have to assign scores to: – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: g(k)=  k Affine: g(k)=  +  k Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all alignment columns

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] Sequence Analysis Substitution matrices Define a score for match/mismatch of letters DNA - Simple: - Used in genome alignments

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] Sequence Analysis Substitution matrices for aa Amino acids are not equal: 1. Some are easily substituted, similar: biochemical properties structure 2. Some mutations occur more often due to similar codons The two above give us substitution matrices

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] Sequence Analysis Blosum62 matrix # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = , Expected = A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] Sequence Analysis Linear vs. affine scoring Seq1 G T A - - G - T - A Seq2 - - A T G - A T G - Linear -2 –2 1 –2 –2 (SUM=-7) -2 – –2 (SUM=-7) Affine -3 – –1 (SUM=-7) -3 – –3 (SUM=-11) … and +1 for match Gap Scoring Introductionextension Linear-2 Affine-3

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] Sequence Analysis The algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming – Solve smaller subproblem(s) – Iteratively extend the solution The best alignment for X[ 1…i ] and Y[ 1…j ] is called M[ i, j ] X 1 … X i X i Y 1 … - Y j-1 Y j -

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] Sequence Analysis The algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment: X[1…i-1] X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1] Y[j] Y[1…j-1] Y[j] Y[1…j] - M[i,j]= M[i-1,j-1] M[i,j-1]-g M[i-1,j]-g +m -s

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value form residue exchange matrix i-1 i j-1 j

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] Sequence Analysis Example: global alignment of two sequences Align two DNA sequences: – GAGTGA – GAGGCGA (note the length difference) Parameters of the algorithm: – Match: score(A,A) = 1 – Mismatch: score(A,T) = – 1 – Gap: g = 2 M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] Sequence Analysis The algorithm. Step 1: init Create the matrix Initiation – 0 at [0,0] – Apply the equation… M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 jj i  -GAGTGA G 2A 3G 4G 5C 6G 7A

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] Sequence Analysis The algorithm. Step 1: init Initiation of the matrix: – 0 at pos [0,0] – Fill in the first row using the “  ” rule – Fill in the first column using “  ” M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA G -2 A -4 G -6 G -8 C -10 G -12 A -14 j i

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] Sequence Analysis The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows) M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA G A -42 G -6 G -8 C -10 G -12 A -14 j i

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] Sequence Analysis The algorithm. Step 2: fill in We are done… Where’s the result? M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA G A G G C G A j i

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] Sequence Analysis The algorithm. Step 3: backtrace Start at the last cell of the matrix Go in the direction of arrows Sometimes the value may be obtained from more than one cell (which one?) -GAGTGA G A G G C G A j i M[i-1,j-1]  1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] Sequence Analysis The algorithm. Step 3: backtrace Extract the alignments a) GAGT-GA GAGGCGA b) GA-GTGA GAGGCGA -GAGTGA G A G G C G A j i a b

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] Sequence Analysis Global dynamic programming – general algorithm M[i-1,j-1] M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - g open - (i-x- 1)g extension } max{M[i-1, 0<x<j-1] - g open - (i-y- 1)g extension } Value form residue exchange matrix i-1 i j-1 j Gap open penalty Gap extension penalty Number of gap extensions This more general way of dynamic programming also allows for affine or other gap penalties

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] Sequence Analysis Easy DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check Cell[i-1, j-1]: apply score for cell[i-1, j-1] preceding row until j-2: apply appropriate gap penalties preceding column until i-2: apply appropriate gap penalties i-1 j-1

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] Sequence Analysis Note about gap penalties Some affine schemes use gap_penalty = -g open –g extension *(l-1), while others use gap_penalty = -g open –g extension *l, where l is the length of the gap.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] Sequence Analysis Global dynamic programming g open =10, g extension =2 DWVTALK T D W V L K DWVTALK T D W V L K These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] Sequence Analysis Variation on global alignment Global alignment: the previous algorithm is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG – Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other – Danger! – really bad alignments for divergent seqs seq X: seq Y:

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] Sequence Analysis Take-home message Homology Why are we interested in similarity? Pairwise alignment: global alignment