Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Sequence analysis course
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Sequence similarity.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Presentation transcript:

Sequence Alignment - III Chitta Baral

Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by a process of mutation and selection Basic mutational processes –Substitutions; –insertions; deletions (together referred to as gaps) Total Score – sum for each aligned pair + terms for each gap –Corresponds to: logarithm of the related likelihood that the sequences are related, compared to being unrelated. –Identities and conservative substitutions to be more likely (than by chance): contribute positive score terms –Non-conservative changes are observed to be less frequently in real alignments than we expect by chance: contribute negative score terms –Additive scoring scheme: Based on assumption that mutations at different sites in a sequence to have occurred independently Reasonable for DNA and protein sequences Inaccurate for structural RNAs

Substitution Matrices Notation: pair of sequence x[1..n] and y[1..m] –Let x i be the ith symbol in x –And y j be the jth symbol in y –Let p xiyi – probability that x i and y i are related –Let q xi – probbaility that we have x i by chance Frequency of occurrence of x i Score: log [ P(x and y supposing they are related)/ P (x and y supposing they are unrelated)] P(x and y supposing they are related) = p x1y1 p x2y2 … P(x and y supposing they are unrelated) = q x1 q x2 … X q y1 q y2 … Odds ratio: (p x1y1 /q x1 q y1 ) X (p x2y2 /q x2 q y2 ) X … Log-odds ratio: s(x 1,y 1 ) + s(x 2, y 2 ) + … –Where s(a,b) = log (p ab /q a q b ) –The s(a,b) table is known as the score matrix or substitution matrix

Gap Penalties Also based on a probabilistic model of alignment –Less widely recognized than the probabilistic basis of substitution matrices Gap of length g due to insertion of a 1 …a g –p(gap because of mutation) = f(g) (q a1 …q ag ) –p(having a1…ag by chance) = q a1 …q ag –Ratio = f(g) –Log of ratio = log (f(g)) –Geometric distribution: f(g) = ke -xg –Suppose f(g) = e -gd ; then log of ratio = -gd ## linear score –Suppose f(g) = ke -ge ; then log of ratio = -ge + log k = -ge + e + (log k - e) = - (e - log k) – (g – 1) e = - d – (g-1) e where d = e – log k ## affine score

Repeated matches A big string x[1..n] and smaller string y[1..m] Asymmetric: looking for multiple matches of y in x. As we do the matching and fill the table, we need to decide when to stop going further in y, and start over from the beginning of y. F(i,0): Assuming x i is in an unmatched region, what is the best total score so far. F(i,j), j >= 1: Assuming x i is in a matched region and the last matching ends at x i and y j, the best total score so far. F(0,0) = 0. F(i,i) = maximum of { F(i,0) ; F(i-1,j-1) + s(xi,yj) ; F(i-1,j)-d ; F(i,j-1) – d } –F(i,0) corresponds to start over option (but now we store the total score so far) F(i,0) = maximum of –F(i-1,0) –F(i-1, j) – T j = 1, …, m –T is a threshold and we are only interested in matches scoring higher than the threshold. (Important: because there are always short local alignments with small positive scores even between entirely unrelated sequences.)

Illustration of repeated matches HEAGAWGHEE  9 P A W H E A E

Next Alignment with affine gap scores. Heuristic based approach.