Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}

Slides:

Advertisements

Similar presentations

Hidden Markov Model in Biological Sequence Analysis – Part 2

Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

Phylogenetic Trees Lecture 4

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Molecular Evolution Revised 29/12/06

A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.

Heuristic alignment algorithms and cost matrices

CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.

Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.

Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.

Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.

Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}

Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.

Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling.

Phylogeny Tree Reconstruction

Advanced Questions in Sequence Evolution Models Context-dependent models Genome: Dinucleotides..ACGGA.. Di-nucleotide events ACGGAGT ACGTCGT Irreversibility.

Probabilistic methods for phylogenetic trees (Part 2)

Phylogeny Tree Reconstruction

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.

Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:

Introduction to Profile Hidden Markov Models

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Protein Sequence Alignment and Database Searching.

Hidden Markov Models for Sequence Analysis 4

BINF6201/8201 Hidden Markov Models for Sequence Analysis

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

 -globin ( 141) and  -globin (146) V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Evolutionary Models for Multiple Sequence Alignment CBB/CS 261 B. Majoros.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Calculating branch lengths from distances. ABC A B C----- a b c.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Statistical Alignment and Footprinting Rutgers – DIMACS The Problem Statistical Alignment - Annotation - Annotation & Statistical Alignment Statistical.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.

Phylogeny Ch. 7 & 8.

Approaches to Sequence Analysis Data {GTCAT,GTTGGT,GTCA,CTCA} GT-CAT GTTGGT GT-CA- CT-CA- s2s2 s3s3 s4s4 s1s1 statistics Parsimony, similarity, optimisation.

Schedule Day 1: Molecular Evolution Lecture: Models of Sequence Evolution and Statistical Alignment Practical: Molecular Evolution.

Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Modelling evolution Gil McVean Department of Statistics TC A G.

HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),

Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Distance based phylogenetics

Inferring a phylogeny is an estimation procedure.

The ideal approach is simultaneous alignment and tree estimation.

Multiple Alignment and Phylogenetic Trees

Overview Pairwise Alignment Again Triple – Quadruple - Many

Recombination, Phylogenies and Parsimony

The Most General Markov Substitution Model on an Unrooted Tree

CS 394C: Computational Biology Algorithms

Presentation transcript:

Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA} Actual Practice: 2 phase analysis. Ideal Practice: 1 phase analysis. 1.TKF91 - The combined substitution/indel process. 2.Acceleration of Basic Algorithm 3.Many Sequence Algorithm 4.MCMC Approaches

Number of alignments, T(n,m) T(n,m) is the number of alignments of s1[1,n] and s2[1,m] then T(n,m)=T(n-1,m)+T(n,m-1)+T(n-1,m-1) T(0,0)=1 T(n,m) > 3 min(n,m) Alignments columns are equivalent to step (0,1), (1,0) and (1,1) in a [0,n][0,m] matrix. Thus alignment by alignment search for best alignment is not realistic. If n- -n n- is equivalent to then alignments are equivalent to choosing two subsets of s1 and and s2 that has to be matched, thus T G T T C T A G G

D i,j =min{D i-1,j-1 + d(s1[i],s2[j]), D i,j-1 + g, D i-1,j +g} Parsimony Alignment of two strings. {CTAG,TTG} AL = Sequences: s1=CTAGG s2=TTGT. 5, indels (g) 10. Cost Additivity Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10. CTAG CTA G = + TT-G TT- G {CTA,TT} AL + GG (A) {CTA,TTG} AL + G- (B) {CTAG,TT} AL + -G (C) Min [] Initial condition: D 0,0 =0. (D i,j := D(s1[1:i], s2[1:j]))

T G T T C T A G G CTAGG Alignment: i v Cost 17 TT-GT

Alignment of three sequences. s1=ATCG s2=ATGCC s3=CTCC Consensus sequence: ATCC Alignment: AT-CG ATGCC CT-CC C A A ? Configurations in an alignment column: - - n n n - n - - n - n - n n - n n n n - Initial condition: D 0,0,0 = 0. Recursion: D i,j,k = min{D i-i', j-j', k-k' + d(i,i',j,j',k,k')} Running time : l 1 *l 2 *l 3 *(2 3 -1) Memory requirement: l 1 *l 2 *l 3 New phenomena: ancestral/consensus sequence. AACAAC

G G C C Parsimony Alignment of four sequences s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG Configurations in alignment columns: n n n n - n n n n n - n n - n - - n - n n n - - n - - n - n - n - n n - n n - n n n - - n n n n - n - Alignment: AT-CG ATGCC CT-CC ACGCG Initial condition: D 0 = 0. Memory : l 1 *l 2 *l 3 *l 4 New Phenomena: Cost and alignment is phylogeny dependent GCCGGCCG Computation time: l 1 *l 2 *l 3 *l 4 *2 4 Memory : l 1 *l 2 *l 3 *l 4 Recursion: D i = min{D i-∆ + d(i,∆)} ∆ [{0,1} 4 \{0} 4 ]

Sodh Sodb Sodl sddm Sdmz sodsSdpb Progressive Alignment (Feng-Doolittle 1987 J.Mol.Evol.) Can align alignments and given a tree make a multiple alignment. * * alkmny-trwq acdeqrt akkmdyftrwq acdehrt kkkmemftrwq [ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6 * * *** * * * * * * Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrk Sodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrk Sddm atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq— infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp Lsrk Sods vatkavcvlkgdgpqvq— infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—-infeqkesdgpv----wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk

Thorne-Kishino-Felsenstein (1991) Process (birth rate)  (death rate) A # C G ### # T= 0 T = t # s2 s1 s2 r s1 s2 2. Time reversible: 1. P(s) = (1-  )(  ) l  A  #A *.. *  T #T l =length(s) # # # *

&  into Alignment Blocks A. Amino Acids Ignored: e -  t [1-  ](  ) k-1 # # # k # # # # # k  =[1-e (  )t ]/[  e (  )t ] p k (t) p’ k (t) [1-  -  ](  ) k p’ 0 (t)=  (t) * * # # # # k [1-  ](  ) k p’’ k (t) B. Amino Acids Considered: T R Q S W P t (T-->R)*  Q *..*  W *p 4 (t) 4 T R Q S W  R *  Q *..*  W *p’ 4 (t) 4

Basic Pairwise Recursion (O(length 3 )) Survives: Dies: i-1 j-2 i j i-1 i j-1 j …………………… 1… j (j) cases …………………… j i-1i j i j-1 0… j (j+1) cases …………………… i j e -  t [1-  ](  ) k-1, where  =[1-e (  )t ]/[  e (  )t ]

Basic Pairwise Recursion (O(length 3 )) (i,j) i j i-1 j-1 (i-1,j) (i-1,j-1) survive death (i-1,j-k) ………….. Initial condition: p’’=s2[1:j]

Accelleration of Pairwise Algorithm (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) Corner Cutting ~ Better Numerical Search ~ Ex.: good start guess, 28 evaluations, 3 iterations Simpler Recursion ~3-10 Faster Computers ~ >2000 ~10 6

 -globin ( 141) and  -globin (146) (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) : -log(  -globin ) : -log(  -globin -->  -globin) : -log(  -globin,  -globin) = -log(l(sumalign)) *t: /  *t: / s*t: / E(Length) E(Insertions,Deletions) E(Substitutions) Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) =

Homology test. (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) 1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart. 2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary. D(s1,s2) is evaluated in D(s1,s2*) W i,j = -ln(  i *P 2.5 i,j /(  i *  j )) Random s1 = ATWYFC-AKAC s2* = LTAYKADCWLE * Real s1 = ATWYFCAK-AC s2 = ETWYKCALLAD *** ** *  -, myoglobin homology tests

Goodness-of-fit of TKF91 cgtgttacatatatatagccgatagccg Sample random alignments from real sequences cgtgttacatatatatagccgatagccg Sample random alignments from random sequences Compare real and random distribution using Chi-square statistic.

Algorithm for alignment on star tree (O(length 6 )) (Steel & Hein, 2001) * (  ) *###### a s1 s2 s3 *ACGC*TT GT *ACG GT

Statistical Alignment via Hidden Markov Models Steel and Hein, Holmes and Bruno,2001 C T CAC - # # E # # - E * *  e -   e -   - #  e -   e -    # #  e -   e -   # -   e -   e -    (C)f(C  C)  (C)f(C  T)  (A) HMM formulation allows: Finding most probable alignment Probability of sequence pair Probability of specific edge

Transition Probabilities between two k-ancestral states # # - 3 # # 4 - # 5 # # 6 # - 7 # - 0 # -

Human alpha hemoglobin; Human beta hemoglobin; Human myoglobin Bean leghemoglobin Probability of data e Probability of data and alignment e Probability of alignment given data * = e Ratio of insertion-deletions to substitutions: Maximum likelihood phylogeny and alignment Gerton Lunter Istvan Miklos Alexei Drummond Yun Song

Metropolis-Hastings Statistical Alignment. Lunter, Drummond, Miklos, Jensen & Hein, 2005 The alignment moves: We choose a random window in the current alignment ALITL---GG ALLTLTTLGG ---TLTSLGA ALLGLTSLG A TNQHVSCTGN GN-HVSCTGK TNQH-SCTLN TNQHVSCTLN QST--QCC-S S------CCS ---QST--QC ALITL---GG ALLTLTTLGG ---TLTSLGA ALLGLTSLGA TNQHVSCTGN GN-HVSCTGK TNQH-SCTLN TNQHVSCTLN QSTQCCS SCCS QSTQC Then delete all gaps so we get back subsequences Stochastically realign this part ALITL---GG ALLTLTTLGG ---TLTSLGA ALLGLTSLGA TNQHVSCTGN GN-HVSCTGK TNQH-SCTLN TNQHVSCTLN QSTQCCS -S--CCS QSTQC-- The phylogeny moves: As in Drummond et al. 2002

Metropolis-Hastings Statistical Alignment Lunter, Drummond, Miklos, Jensen & Hein, 2005

Many Sequences: Sequence Graphs (reticular alignment) Istvan Miklos – Gerton Lunter – Miklos Csuros Investigate a set of ancestral sequences/alignments that are computationally realistic A set of homologous sequences are given ccgttagct With a known phylogeny Pairs of sequences are aligned Graphs defined representing alignment/ancestral sequences Pairs of graphs aligned….

TKF92 Like TKF91, except that that nucleotides are substituted by geometric length flakes of nucleotides. A flake does not experience indels. Extensions Local Statistical Alignment Homologous segments are now embedded with unrelated sequences. Both regions can be well modelled. ### # # Long Indel Model Now the insertions will have to be given a length distribution. Deletions will be associated intervals on the sequences. An l 4 algorithm is available.

Summary and Future Work A statistical approach to alignment A Stochastic Model including Insertion-Deletions The fate of a single nucleotide Dynamical Programming solution to the pairwise problem An HMM solution to pairwise statistical alignment Multiple statistical alignment Problems Ahead (enough to do) Longer Insertion-Deletions Heterogeneity of positions Testing Models Combining with Annotation Very Large Number of Sequences

References Statistical Alignment Fleissner R, Metzler D, von Haeseler A. Simultaneous statistical multiple alignment and phylogeny reconstruction.Syst Biol Aug;54(4): Fleissner R, Metzler D, von Haeseler A. Hein,J., C.Wiuf, B.Knudsen, Møller, M., and G.Wibling (2000): Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit. (J. Molecular Biology ) Hein,J.J. (2001): A generalisation of the Thorne-Kishino-Felsenstein model of Statistical Alignment to k sequences related by a binary tree. (Pac.Symp.Biocompu p (eds RB Altman et al.) Steel, M. & J.J.Hein (2001): A generalisation of the Thorne-Kishino-Felsenstein model of Statistical Alignment to k sequences related by a star tree. ( Letters in Applied Mathematics) Hein JJ, J.L.Jensen, C.Pedersen (2002) Algorithms for Multiple Statistical Alignment. (PNAS) 2003 Dec 9;100(25): Holmes, I. (2003) Using Guide Trees to Construct Multiple-Sequence Evolutionary HMMs. Bioinformatics, special issue for ISMB2003, 19:147i–157i. Using Guide Trees to Construct Multiple-Sequence Evolutionary HMMs. Jensen, J.L. & Hein, J. (2004) A Gibbs sampler for statistical multiple alignment. Statistica Sinica, in press. Miklós, I., Lunter, G.A. & Holmes, I. (2004) A 'long indel' model for evolutionary sequence alignment. Mol. Biol. Evol. 21(3):529–540. A 'long indel' model for evolutionary sequence alignment. Lunter, G.A., Miklós, I., Drummond, A.J., Jensen, J.L. & Hein, J. (2005) Bayesian Coestimation of Phylogeny and Sequence Alignment. BMC Bioinformatics, 6:83 Bayesian Coestimation of Phylogeny and Sequence Alignment Lunter, G.A., Miklós, I., Drummond, A., Jensen, J.L. & Hein, J. (2003) Bayesian phylogenetic inference under a statistical indel model. ps pdf Lecture Notes in Bioinformatics, Proceedings of WABI'03, 2812:228–244. ps pdf Lunter, G.A., Miklós, I., Song, Y.S. & Hein, J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10(6):869–88 Miklos, Lunter & Holmes (2002) (submitted ISMB) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. Miklos, I & Toroczkai Z. (2001) An improved model for statistical alignment, in WABI2001, Lecture Notes in Computer Science, (O. Gascuel & BME Moret, eds) 2149: Springer, Berlin Metzler D. “Statistical alignment based on fragment insertion and deletion models.” Bioinformatics Mar 1;19(4): Miklos, I (2002) An improved algorithm for statistical alignment of sequences related by a star tree. Bul. Math. Biol. 64: Miklos, I: Algorithm for statistical alignment of sequences derived from a Poisson sequence length distribution Disc. Appl. Math. accepted. Thorne JL, Kishino H, Felsenstein J. Inching toward reality: an improved likelihood model of sequence evolution.J Mol Evol Jan;34(1):3-16. Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences.J Mol Evol Aug;33(2): Erratum in: J Mol Evol 1992 Jan;34(1):91. Thorne JL, Churchill GA. Estimation and reliability of molecular sequence alignments.Biometrics Mar;51(1): TKF92, Long Indel, Explain HMM, Multiple Recursion, Hidden State Space, 1-state recursion and other reductions, competing algorithms,

The invasion of the immortal link VLSPADNAL.....DLHAHKR 141 AA long ???????????????????? k AA long years years years * ########### …. ### 141 AA long * ########### …. ### 10 9 years

Binary Tree Problem The problem would be simpler if: s1 s2 s3 s4 a1a2 ACCT GTT TGA ACG A Markov chain generating ancestral alignments can solve the problem!! a1 a2 * # # - - # # - # i.The ancestral sequences & their alignment was known. ii. The alignment of ancestral alignment columns to leaf sequences was known How to sum over all possible ancestral sequences and their alignments?:

One block derivation  p k =  t*[ *(k-1) p k-1 +  *k*p k+1 - (  )*k*p k ] # # # #... # 1 k pkpk # # # #... # 1 k+1  # # #*#... # 1 k-1 # # #*#... # 1 k-1  # # # #... # 1 k+1

# # # #... # Differential Equations for p-functions # # # #... # * * # # #... # Initial Conditions: p k (0)= p k ’’(0)= p’ k (0)= 0 k>1 p 1 (0)= p 0 ’’(0)= 1. p’ 0 (0)= 0  p k =  t*[ *(k-1) p k-1 +  *k*p k+1 - (  )*k*p k ]  p’ k =  t*[ *(k-1) p’ k-1 +  *(k+1)*p’ k+1 -(  )*k*p’ k +  *p k+1 ]  p’’ k =  t*[ *k*p’’ k-1 +  *(k+1)*p’’ k+1 - [(k+1) +k  ]*p’’ k ]

Unbreakable fragments TKF92 - Unbreakable fragments Fragments evolve into fragments. All possible tilings of the sequences with geometric length fragments are considered.

can model overlapping indels more involved dynamic programming: Long Insertion-Deletions

The Basics of Footprinting II Many un-aligned sequences related by a known phylogeny: Conceptually simple, computationally hard Dependent on a single alignment/no measure of uncertainty Statistical Alignment A T G Explicit stochastic model of substitution and indel evolution A C Advantages: Summing over uncertainty + confidence on inference sometimes HMM: #### #-#- -#-#

Statistical Alignment and Footprinting. sequences k 1 Alignment HMM acgtttgaaccgag---- Signal HMM Alignment HMM sequences k 1 acgtttgaaccgag---- sequences k 1 acgtttgaaccgag---- Comment: The A-HMM * S-HMM is an approximate approach as S-HMM does not include an evolutionary model nnnnnnnnnnn Ex.:

“Structure” does not stem from an evolutionary model Structure HMM S F 0.1 S F F FF 0.9 F FSFS 0.1 S SS 0.9 S SFSF 0.1 The equilibrium annotation does not follow a Markov Chain: ? F F S S F Each alignment in from the Alignment HMM is annotated by the Structure HMM: using the HMM at the alignment will give other distributions on the leaves No ideal way of simulating: using the HMM at the root will give other distributions on the leaves Alignment HMM Structure HMM (A,S)(A,S)

Markov Chains Generating the p-functions Ancestral Sequence Generator * # # # # # E *  #  p’’ function generator * * # # # #  **** -#-# -#-# E p’/p function generator # # # # # #   #### #-#- -#-# E # # # # # -#-# 

The Basic Recursion SE ”Remove 1 st step” - recursion: ”Remove last step” - recursion: Last/First step removal are inequivalent, but have the same complexities. First step algorithm is the simplest.

Sequence Recursion: First Step Removal P  (S k ): Epifixes (S[k+1:l]) starting in given MC starts in . P  (S k ) = E   F( k S i,H) Where P’( k S i,H  =

Fundamental Pairwise Recursion. P(s1 i ->s2 j ) = p’ 0 P(s1 i-1 ->s2 j ) + Initial Condition P(s1 0 ->s2 j ) = p j ’’  s2[1:j] Simplification: R i,j =(p 1 f(s1[i],s2[j]+p’ 1  s2j[j] )P(s1 i-1 ->s2 j-1 ) P(s1 i ->s2 j ) = R i,j + p’ 0 P(s1 i ->s2 j-1 ) P(s1 i ->s2 j ) = p’ 0 P(s1i-1->s2j)+  P(s1i->s2j-1) + (p 1 f(s1[i],s2[j]+p’ 1  s2j[j]-  s2j[j] ))P(s1 i-1 ->s2 j-1 ) Probability of observation P(s1, s2) = P(s1) P(s1 ->s2)

Gibbs Samplers for Statistical Alignment Holmes & Bruno (2001): Sampling Ancestors to pairs. Jensen & Hein (in press): Sampling nodes adjacent to triples Slower basic operation, faster mixing

Statistical Alignment, Homology and Linguistics Robin Ryder Stephen ClarkMarkus Gerstel 2008 String Comparison String Homology

Refinements to Statistical Alignment 4. Long Distance Correlations The present model of statistical alignment is very naive. Much is needed for both biological and linguistics applications. Here is a short list. ATWYFCAKAC 2. Swaps ATWYCFAKAC 1. Longer insertion-deletions A--YFCAKAC ATWYFCAKAC 3.Positional heterogeneity/ Functional annotation/Hidden States ATWYFCAKAC FFFSSFSSSS 5. Better equilibrium distribution