Learning to Align: a Statistical Approach

Slides:

Advertisements

Similar presentations

Global Sequence Alignment by Dynamic Programming.

Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Expected accuracy sequence alignment

Regular Expression Constrained Sequence Alignment Abdullah N. Arslan Assistant Professor Computer Science Department.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.

Multiple Sequence alignment Chitta Baral Arizona State University.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Sequence Alignment III CIS 667 February 10, 2004.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

Protein Sequence Comparison Patrice Koehl

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Local alignment

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Developing Pairwise Sequence Alignment Algorithms

Sequence Alignment.

Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.

Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

Protein Sequence Alignment and Database Searching.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Chapter 3 Computational Molecular Biology Michael Smith

We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Expected accuracy sequence alignment Usman Roshan.

A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.

Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Expected accuracy sequence alignment Usman Roshan.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

INTRODUCTION TO BIOINFORMATICS

The ideal approach is simultaneous alignment and tree estimation.

Sequence comparison: Local alignment

Biology 162 Computational Genetics Todd Vision Fall Aug 2004

An Introduction to Support Vector Machines

Sequence Alignment 11/24/2018.

Using Dynamic Programming To Align Sequences

Intro to Alignment Algorithms: Global and Local

N-Gram Model Formulas Word sequences Chain rule of probability

Sequence comparison: Local alignment

BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment

Pairwise Alignment Global & local alignment

Sequence Alignment Algorithms Morten Nielsen BioSys, DTU

Error Correction Coding

Multiple Sequence Alignment

Presentation transcript:

Learning to Align: a Statistical Approach IDA 2007 Elisa Ricci, University of Perugia (IT) Tijl De Bie, Nello Cristianini, University of Bristol (UK)

Outline Sequence alignment Z-score as function of the alignments parameters Z-score computation by dynamic programming Inverse Parametric Sequence Alignment Problem (IPSAP) Z-score maximization to solve IPSAP Experimental results Artificial data PALI dataset of protein structure alignments Conclusions

Sequence Alignment Definition: Given two sequences S1, S2 a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence. It is used to determine the similarity between biological sequences. Example: S ={A,T,G,C}, S1 , S2 S S1 ATGCTTTC S2 CTGTCGCC ATGCTTTC--- ---CTGTCGCC A

f (S1, S2, A) = am m + as s + ag g = aT x Sequence Alignment Score of the alignment: a linear function of the parameters. 3-parameter model: matches are rewarded with am , mismatches are penalized by as , gaps are weighted by ag. f (S1, S2, A) = am m + as s + ag g = aT x with xT =[m s g]=[#matches #mismatches #gaps] and aT = [am as ag]. Example: f (S1,S2, A) = 4am + as + 6ag But how to determine the optimal alignment? A scoring scheme must be produced. ATGCTTTC--- ---CTGTCGCC A

Sequence Alignment 4-parameter model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty ao) in a given position or if it continues (gap extension penalty ae). 211/212-parameter model: gap penalties plus a symmetric scoring matrix with elements ayt, y,t S, S ={A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}. C paired with D aCD =aDC

Sequence Alignment Optimal alignment: the highest score alignment. Optimality depends on the parameters a. The number of possible alignments N is exponential in the length of the sequences. The optimal alignment is computed using dynamic programming (DP) in a O(nm) time [Needleman-Wunsch, 1970]. Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph. A T G C T T T C C T G

Moments of the scores The mean and the variance of the scores can be expressed as function of the parameters a. Example: For the 3-parameter model:

The Z-score Definition: Let m(S1, S2) and s2(S1, S2) be the average score and the variance of the scores for all possible alignments between S1 and S2. Let be the optimal alignment between S1 and S2 for a given a and be the associated feature vector. We define the Z-score Z(S1, S2): where .

Computing the Z-score Given a parameter vector a, the Z-score can be computed with DP for the 3-, 4-, 211-, 212-parameter models. Example: For the 3-parameter model 9 DP routines are required. DP table A T G C 0.11 0.23 0.12 0.09 0.34 0.22

Computing the Z-score 2 DP tables: p, mm. Inductive assumption: p(i, j-1), p(i-1,j), p(i-1, j-1) are the number of alignments. mm(i, j-1), mm(i-1,j), mm(i-1, j-1) are the correct mean values. Each cell is filled with the following rules: p(i, j) = p(i-1, j-1) + p(i, j-1) + p(i-1, j) mm (i-1, j-1) p(i-1, j-1) + Mp(i-1, j-1) mm (i, j) p(i, j) = sum mm(i, j-1) p(i, j-1) mm(i-1, j) p(i-1, j) where M = 1, if S1(i) = S2(j) ; M = 0, if S1(i) ≠ S2(j) .

Computing the Z-score Basic principle: Mean values: Variances are computed centering the second order moments:

IPSAP Inverse Parametric Sequence Alignment Problem (IPSAP): given a training set of pairwise global alignments learn the parameters a in such a way that the given alignments have the best scores among all possible alignments. Training set Find a s.t. Exponential number of linear constraints. Iterative approaches: linear programming [Kececioglu and Kim 06], max margin [Joachims et al. 05].

Z-score maximization Idea: global objective function, more naturally suited for non-separable cases. Z-score maximization: Minimize the number of alignments with score higher than the given one. m s

Z-score maximization Z-score of a training set: Convex optimization Most linear constraints are satisfyied. (QP)

Iterative algorithm Impose explicitly the violated constraints. Again a convex optimization problem. Iterative algorithm. Eventually relax constraints (e.g. add slack variables for non separable problems).

Iterative algorithm INPUT: training set T 1: C ← ø 2: Compute bi, Ci for all i=1…ℓ 3: Compute b*=sum(bi), C*=sum(Ci) 4: Find a solving QP. 5: Repeat 6: for i=1…ℓ do 7: Compute xi’=argmaxx f (Si1, Si2, Ai) 8: if aTxi’> aT 9: C ← C U { aT ( -xi’)> 0 } 10: Find a solving QP s.t. C 11: endif 12: endfor 13: until C is not changed in during the current iteration. Moments computation Z-score maximization Constrained Identify the most violated constraint

Experimental results Test error as function of the training set size. Distribution of correctly reconstructed alignments as a function of the number of additional constraints.

Experimental results Experiments with no constraints. Test error as function of the training set size. Given and computed substitution matrices. 5 10 20 50 100 Z-score 78.6 62.85 44.6 36.7 30.84 Generative 96.4 94.39 87.12 45.31 31.05

Experimental results Real sequences of amino acids: 5 multiple alignments from the PALI database of structural protein alignments. Error rates and added constraints (in parenthesis). 4-parameters 212-parameters Dataset Training error Test error nad 4.95 (5) 6.46 567.46 (21) 703.12 kun 1.46 (12) 0.95 386.46 (21) 457.3 box 1 (3) 1.13 211.3 (12) 256.7 sir 1 (10) 1.16 236 (36) 301.44 pec 46.2 (8) 76.1 835.12 (31) 1054.12

Summary New method for IPSAP: Further works: Accurate and fast (few constraints are required). Easy to implement: DP for computing moments and simple convex optimization problem. Mean and variance computations parallelizable for large training set. Further works: Approximate moments estimation with sampling techniques is suitable. Possible extension to other problems: sequence labeling learning and sequence parse learning with context free grammars.