CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Global Sequence Alignment by Dynamic Programming.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Sequence Alignment.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Reuters Another hat tossed into the million genome ring (so to speak)…
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
DNA, RNA and protein are an alien language
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Reuters Another hat tossed into the million genome ring (so to speak)…
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence comparison: Local alignment
Alignment IV BLOSUM Matrices
Presentation transcript:

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

CISC667, S07, Lec5, Liao Sequence Alignment Motivation –Sequence assembly: reconstructing long DNA sequences from overlapping sequence fragments –Annotation: assign functions to newly discovered genes Raw genomic (DNA) sequences  coding sequences (CDS), candidate for genes  protein sequence  function Terminologies: cDNA, RNA, mRNA Evolution: mutation  sequence diversity (versus homology)  (new) phenotype ? Basis for annotation: sequence similarity  sequence homology  same function –Caveat: homology can only be inferred, not affirmed, since we can not rewind to see how evolution actually happened.

CISC667, S07, Lec5, Liao Ancestral sequence: ACGTACGT After 9540 generations (del: , ins: 0.001, trans_mut: , transv_mut: ) Sequence1: ACACGGTCCTAATAATGGCC Sequence2: CAGGAAGATCTTAGTTC True history: --ACG-T-A---CG-T---- ACACGGTCCTAATAATGGCC ---AC-GTA-C--G-T-- CAG-GAAGATCTTAGTTC Alignment that reflects the true history: Seq1: -ACAC-GGTCCTAAT--AATGGCC Seq2: CAG-GAA-G-AT--CTTAGTTC--

CISC667, S07, Lec5, Liao Alignment algorithms What is an alignment? A one-to-one matching of two sequences so that each character in a pair of sequences is associated with a single character of the other sequence or with a null character (gap). Alignments are often displayed as two rows with an optional third row in between pointing out regions of similarity. Example: Types of alignment: –pairwise vs multiple; –global vs local Algorithms –Rigorous –heuristic

CISC667, S07, Lec5, Liao Substitution Score matrix Alignments are used to reveal homologous proteins/genes Substitution scores are used to assess how good the alignments of a pair of residues are. Under the assumption that each mutation (i.e., deletion, insertion, and substitution) is independent, the total score of an alignment is the sum of scores at each position. Substitution score matrix is a 20 x 20 matrix that gives the score for every pair of amino acids. The ways to derive a substitution score matrix. –Ad hoc –Physical/chemical properties of amino acids –Statistical

CISC667, S07, Lec5, Liao PAM matrices (Margaret Dayhoff, 1978) point accepted mutation or percent accepted mutation unit of measurement of evolutionary divergence between two amino acid sequences substitute matrices (scoring matrices) 1 PAM = one accepted point-mutation event per one- hundred amino acids

CISC667, S07, Lec5, Liao PAM (cont’d) caveat: Sequences s1 and s2 are x PAM divergent does not imply s1 and s2 have x percent sequence difference ( should be equal or less). facts: 1.even amino acid sequences that have diverged by 200 PAM units are expected to be identical in about 25% of their positions. 2.sequences that are 250 PAM units diverged can generally be distinguished from a pair of random sequences.

CISC667, S07, Lec5, Liao PAM matrix is a 20 by 20 matrix, and each element p ij represents the expected evolutionary exchange between the two corresponding amino acids for sequences that are a specific number of PAM units diverged. That is, p ij = log[f(i,j)/f(i)f(j)] where f(i) and f(j) are the frequencies that amino acids A i and A j appear in the sequences, and f(i,j) the frequency that A i and A j are aligned. By construction, PAM n matrices with n > 1 are extrapolated from those with lower n. Namely, it assumes that the frequencies of the amino acids remain constant over time and that the mutational processes causing replacements in an interval of one PAM unit operate in the same way for longer periods.

CISC667, S07, Lec5, Liao Ideally, to align two sequences that are x PAM units diverged, the PAM x matrix should be used. Practically, since you do not know a priori how much diverged given two sequences, you may need to try different PAM matrices. Empirically, PAM 250 matrix is reported to be very effective for aligning sequences used in evolutionary studies.

CISC667, S07, Lec5, Liao

BLOSUM matrices [Steven and Jorja Henikoff] - BLOSUM x matrix is a 20 by 20 matrix. Its elements are defined like those of PAM matrices but the frequencies are collected from sequences in BLOCKs database that are less than x percent identical (generally x is between 50 and 80). - By their construction, BLOSUM matrices are believed to be more effectively detect distant homology. - Taking the place of PAM 250, BLOSUM 62 is now the default matrix used in database search.

CISC667, S07, Lec5, Liao BLOSUM50

CISC667, S07, Lec5, Liao Needleman-Wunsch algorithm (Global Pairwise optimal alignment, 1970) To align two sequences x[1...n] and y[1...m], i)if x at i aligns with y at j, a score s(x i, y j ) is added; if either x i or y j is a gap, a score of d is subtracted (penalty). ii) The best score up to (i,j) will be F(i,j) = max { F(i-1, j-1) + s(x i, y j ), F(i-1,j)  d, // gap in y F(i, j-1)  d // gap in x }

CISC667, S07, Lec5, Liao Needleman-Wunsch (cont’d) iii) Tabular computing to get F(i,j) for all 1<i<n and i<j<m Draw a diagram: F(i-1, j-1) F(i, j-1) s(x i,y j ) -d F(i-1, j) F(i,j) By definition, F(n,m) gives the best score for an alignment of x[1...n] and y[1...m]. -d

CISC667, S07, Lec5, Liao iv) Trace-back To find the alignment itself, we must find the path of choices (in applying the formulae of ii) when tabular computing that led to this final value. > Vertical move is gap in the column sequence. > Horizontal move is gap in the row sequence. > Diagonal move is a match.

CISC667, S07, Lec5, Liao Example: Align HEAGAWGHEE and PAWHEAE. Use BLOSUM 50 for substitution matrix and d=-8 for gap penalty. HEAGAWGHE-E --P-AW-HEAE

CISC667, S07, Lec5, Liao Time complexity: O(nm) Space complexity: O(nm) Big-O notation: f(x) = O(g(x)) => f is upper bound by g f(x) =  (g(x)) => f is lower bound by g f(x) =  (g(x)) => f is bound to g within constant factors