Monen sekvenssin linjaus

Slides:

Advertisements

Similar presentations

Hidden Markov Model in Biological Sequence Analysis – Part 2

Advertisements

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz

Segmentation and Fitting Using Probabilistic Methods

Profiles for Sequences

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.

Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Heuristic alignment algorithms and cost matrices

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Lecture 5: Learning models using EM

Transcription factor binding motifs (part I) 10/17/07.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Multiple sequence alignments and motif discovery Tutorial 5.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

BNFO 602 Multiple sequence alignment Usman Roshan.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Multiple Sequence Alignments

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Bioinformatics Sequence Analysis III

Chapter 5 Multiple Sequence Alignment.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Protein Sequence Alignment and Database Searching.

Hidden Markov Models for Sequence Analysis 4

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Comp. Genomics Recitation 3 The statistics of database searching.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Flat clustering approaches

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Sequence Alignment.

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Hidden Markov Models BMI/CS 576

Multiple sequence alignment (msa)

A Very Basic Gibbs Sampler for Motif Detection

Multiple sequence alignment (msa)

Learning Sequence Motif Models Using Expectation Maximization (EM)

Bayesian Models in Machine Learning

Presentation transcript:

Monen sekvenssin linjaus Proteiinianalyysi 3 Monen sekvenssin linjaus http://www.bioinfo.biocenter.helsinki.fi/downloads/teaching/spring2006/proteiinianalyysi

Multiple sequence alignment (msa) A) define characters for phylogenetic analysis B) search for additional family members SeqA N ∙ F L S SeqB N ∙ F – S SeqC N K Y L S SeqD N ∙ Y L S NYLS NKYLS NFS NFLS +K -L YF

Complexity of performing msa Sum of pairs (SP) score Extension of sequence pair alignment by dynamic programming O(LN), where L is sequence length and N is number of sequences MSA algorithm searches subspace bounded by Heuristic multiple alignment (suboptimal) Pairwise optimal alignments

Pairwise projection Pairwise optimal alignment Greedy multiple alignment Global msa optimum (sequences A,B,C,…) is probably found in the area in-between Sequence A Sequence B

Dynamic programming 3 4 1 2 Maximal path sum BEGIN  END ? 2 4 BEGIN END D Maximal path sum BEGIN  END ? Enumerate every path brute force Use induction: only one optimal path up to any node in graph.

Example: all paths leading to B C 3 3 3 8 3 4 1 BEGIN 7 END 1 2 1 D

Heuristic alignment algorithms Motivation: existing alignment methods Heuristic alignment algorithms

Progressive alignment 1. derive a guide tree from pairwise sequence distances 2. align the closest (groups of) sequences Pluses: works with many sequences Minuses: errors made early on are frozen in the final alignment Programs: Clustal, Pileup, T-Coffee

Progressive pairwise alignment Motivation: existing alignment methods Progressive pairwise alignment Evolution Alignment TACAGTA CACCTA TATA TACGTC TACAGTA TACGTA TACGTA TACGTC TACCTA TATA CACCTA time progress

Progressive pairwise alignment Motivation: existing alignment methods Progressive pairwise alignment 1. TACAGTA T A C A G T A match vertical gap horizontal gap 1. T A C G T A C A G T A T A C G T C 2. TACGTC 3. TATA CACCTA T A 2. T A T A progress C A T 3. C A C G T A

Progressive pairwise alignment Motivation: existing alignment methods Progressive pairwise alignment 1. T A C A G T A T A C G T C 2. T A T A 3. C A C G T A T A C A G T A T A C A G T A T A C G T C T A T A C A C G T A T A C G T C T A T A C A C C T A

Profile HMMs Motivation: existing alignment methods CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

Iterative methods Gotoh DIALIGN inner loop uses progressive alignment outer loop recomputes distances and guide tree based on the msa obtained DIALIGN Identify high-scoring ungapped segment pairs Build up msa using consistent segment pairs Greedy algorithm DIALIGN2; exact DIALIGN3

Motifs - several proteins are grouped together by similarity searches - they share a conserved motif - motif is stringent enough to retrieve the family members from the complete protein database

Local msa Profile analysis a portion of the alignment that is highly conserved and produces a type of scoring matrix called a profile. New sequences can be aligned to the profile using dynamic programming. Block analysis scans a global msa for ungapped regions, called blocks, and these blocks are then used in sequence alignments. Pattern-searching or statistical methods find recurrent regions of sequence similarity in a set of initially unaligned sequences.

TEIRESIAS Finds all patterns with minimum support in input set of unaligned sequences Maximum spacer length For example L…G…………..A….L…L Enumerative algorithm Build up longer patterns by merging overlapping short patterns For example A….L + L…L  A….L…L Fewer instances by chance when more positions are specified Pluses: exact Minuses: biological significance of the patterns?

Expectation maximization (EM) Initial guess of motif location in each sequence E-step: Estimate frequencies of amino acids in each motif column. Columns not in motif provide background frequencies. For each possible site location, calculate the probability that the site starts there. M-step: use the site probabilities as weights to provide a new table of expected values for base counts for each of the site positions. Repeat E-step and M-step until convergence.

EM procedure Eqn 1 Eqn 2 Eqn 1 Eqn 2 Let the observed variables be known as y and the latent variables as z. Together, y and z form the complete data. Assume that p is a joint model of the complete data with parameters θ: p(y,z | θ). An EM algorithm will then iteratively improve an initial estimate θ0 and construct new estimates θ1 through θN. An individual re-estimation step that derives from takes the following form (shown for the discrete case; the continuous case is similar): Eqn 1 In other words, θn + 1 is the value that maximizes (M) the expectation (E) of the complete data log-likelihood with respect to the conditional distribution of the latent data under the previous parameter value. This expectation is usually denoted as Q(θ): Eqn 2 Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm. It can be shown that an EM iteration does not decrease the observed data likelihood function, and that the only stationary points of the iteration are the stationary points of the observed data likelihood function. In practice, this means that an EM algorithm will converge to a local maximum of the observed data likelihood function. EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known. "Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density model. An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood. There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function. Eqn 1 Eqn 2

Exercise Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4 T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A Analyze the following ten DNA sequences by the expectation maximization algorithm. Assume that the background base frequencies are each 0.25 and that the middle three positions are a motif. The size of the motif is a guess that is based on a molecular model. The alignment of the sequences is also a guess.

(a) Calculate the observed frequency of each base at each of the three middle positions Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4 T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4

(b) Calculate the odds likelihood of finding the motif at each of the possible locations in sequence 5 Seq5 CAGAT CAGAT: 0.1*0.6*0.2/0.25/0.25/0.25=0.768 0.1*0.1*0.2/0.25/0.25/0.25=0.128 0.1*0.6*0.4/0.25/0.25/0.25=1.536 Non-motif sites: 0.25/0.25=1 Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4

(c) Calculate the probability of finding the motif at each position of sequence 5 Seq5 CAGAT CAGAT: 0.1*0.6*0.2*0.25*0.25=0.00075 0.25*0.1*0.1*0.2*0.25=0.000125 0.25*0.25*0.1*0.6*0.4=0.0015 Non-motif sites: p=0.25 Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4

(d) Calculate what change will be made to the base count in each column of the motif table as a result of matching the motif to sequence 5. CAGAT: 0.1*0.6*0.2*0.25*0.25=0.00075, rel. weight 0.32 CAGAT: 0.25*0.1*0.1*0.2*0.25=0.000125, rel. weight 0.05 CAGAT: 0.25*0.25*0.1*0.6*0.4=0.0015, rel. weight 0.63 Base 1st 2nd 3rd A 0.05 0.32+0.63=0.95 C 0.32 G 0.63 T Updated counts from sequence 5 (initial location guess shaded).

(e) What other steps are taken to update or maximize the table values? The weighted sequence data from the remaining sequences are also added to the counts table The base frequencies in the new table are used as an updated estimate of the site residue composition. The expectation and maximization steps are repeated until the base frequencies do not change.

Gibbs sampler Start from random msa; realign one sequence against profile derived from the other sequences; iterate Finds the most probable pattern common to all of the sequences by sliding them back and forth until the ratio of the motif probability to the background probability is a maximum.

A. Estimate the amino acid frequencies in the motif columns of all but one sequence. Also obtain background. Random start Motif positions chosen ↓ xxxMxxxxx xxxMxxxxx xxxxxxMxx xxxxxxMxx xxxxxMxxx xxxxxMxxx xMxxxxxxx xMxxxxxxx Xxxxxxxxx Xxxxxxxxx Mxxxxxxxx Mxxxxxxxx xxxxMxxxx xxxxMxxxx xxxxxxxxM xxxxxxxxM B. Use the estimates from A to calculate the ratio of probability of motif to background score at each position in the left-out sequence. This ratio for each possible location is the weight of the position. xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx M M M M

C. Choose a new location for the motif in the left-out sequence by a random selection using the weights to bias the choice. xxxxxxxMxx Estimated location of the motif in left-out sequence. D. Repeat steps A to C many (>100) times.

Exercise: Gibbs sampler Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4 T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A Analyze the left-hand-side DNA sequences by the Gibbs sampling algorithm.

(a) Assuming that the background base frequencies are 0 (a) Assuming that the background base frequencies are 0.25, calculate a log odds matrix for the central three positions. Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4 Base 1st 2nd 3rd A -1.32 1.26 -0.32 C G T 1.49 0.68

GTTTG: -1.32 + -0.32 + 0.68 = -1.00 GTTTG: 1.49 + -0.32 + 0.68 = 1.81 (b) Assuming that another sequence GTTTG is the left-out sequence, slide the log-odds matrix along the left-out sequence and find the log-odds score at each of three possible positions. Base 1st 2nd 3rd A -1.32 1.26 -0.32 C G T 1.49 0.68 GTTTG: -1.32 + -0.32 + 0.68 = -1.00 GTTTG: 1.49 + -0.32 + 0.68 = 1.81 GTTTG: 1.49 + -0.32 + -0.32 = 0.85

log-odds=log2(p/bg), p=2s bg GTTTG: 2-1.00/64=0.078, 0.1*0.2*0.4=0.008 (c) Calculate the probability of a match at each position in the left-out sequence Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4 log-odds=log2(p/bg), p=2s bg GTTTG: 2-1.00/64=0.078, 0.1*0.2*0.4=0.008 GTTTG: 21.81/64=0.055, 0.7*0.2*0.4=0.056 GTTTG: 20.85/64=0.028, 0.7*0.2*0.2=0.028

(d) How do we choose a possible location for the motif in the left-out sequence? 0.008+0.056+0.028=0.092 Normalised weights: GTTTG: 0.008/0.092=0.09 GTTTG: 0.056/0.092=0.61 GTTTG: 0.028/0.092=0.30

Modelling protein families

PSSM The PSSM is constructed by a logarithmic transformation of a matrix giving the frequency of each amino acid in the motif. If a good sampling of sequences is available, the number of sequences is sufficiently large, and the motif structure is not too complex, it should, in principle, be possible to produce a PSSM that is highly representative of the same motif in other sequences also. Pseudocounts Sequence logo, information content

Exercise: construct a PSSM Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 0.6 0.1 0.2 C 0.7 G T

Exercise cntd (a) Assuming the background frequency is 0.25 for each base, calculate a log odds score for each table position, i.e. log to the base 2 of the ratio of each observed value to the expected frequency.

Observed/expected frequencies Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 0.6/0.25=2.4 0.1/0.25=0.4 0.2/0.25=0.8 C 0.7/0.25=2.8 G T

Log-odds scores A C G T Base Column 1 frequency Column 2 frequency

Exercise cntd (b) Align the matrix with each position in the sequence TCACGTAA starting at position 1,2, etc., and calculate the log odds score for the matrix to that position.

A C G T TCACGTAA: -0.32 + 1.49 + -0.32 + -1.32 = -0.47 Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 1.26 -1.32 -0.32 C 1.49 G T TCACGTAA: -0.32 + 1.49 + -0.32 + -1.32 = -0.47 TCACGTAA: -1.32 + -1.32 + -1.32 + -1.32 = -5.28 TCACGTAA: 1.26 + 1.49 + 1.26 + 1.49 = 4.5 TCACGTAA: -1.32 + -1.32 + -1.32 + -1.32 = -2.81 TCACGTAA: -1.32 + -1.32 + -0.32 + -1.32 = -4.28

(c) Calculate the probability of the best matching position. p(TCACGTAA): 0.6*0.7*0.6*0.7=0.1764

Alignment of sequences with a structure: hidden Markov models HMMs suit well on describing correlation among neighbouring sites probabilistic framework allows for realistic modelling well developed mathematical methods; provide best solution (most probable path through the model) confidence score (posterior probability of any single solution) inference of structure (posterior probability of using model states) we can align multiple sequence using complex models and simultaneously predict their internal structure