Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09 The Problem Statistical Alignment - Annotation - Annotation & Statistical Alignment Statistical.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.
Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Patterns, Profiles, and Multiple Alignment.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Molecular Evolution Revised 29/12/06
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Probabilistic methods for phylogenetic trees (Part 2)
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple sequence alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
People Graduates Rahul Satija - Footprinting and Statistical Alignment Joanna Davies - Integrative Genomics of Asthma Aziz Mithani - Comparison of Metabolic.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Cis-regulatory Modules and Module Discovery
Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Hidden Markov Models in Bioinformatics
Approaches to Sequence Analysis Data {GTCAT,GTTGGT,GTCA,CTCA} GT-CAT GTTGGT GT-CA- CT-CA- s2s2 s3s3 s4s4 s1s1 statistics Parsimony, similarity, optimisation.
What is Bioinformatics?
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Transcription factor binding motifs (part II) 10/22/07.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}
Genome Annotation (protein coding genes)
Videocast August 23rd 2013 from Aarhus to Montreal
Distance based phylogenetics
Hidden Markov Models in Bioinformatics min
Dr Tan Tin Wee Director Bioinformatics Centre
The Most General Markov Substitution Model on an Unrooted Tree
Presentation transcript:

Statistical Alignment and Footprinting Rutgers – DIMACS The Problem Statistical Alignment - Annotation - Annotation & Statistical Alignment Statistical Alignment The Model The Pairwise Algorithm – the HMM connection Multiple sequence alignment algorithms Annotation Ahead Annotation & Alignment The general problem protein secondary structure – protein genes – RNA structure - signal The general algorithm Signals (footprinting) Protein Secondary Structure Prediction Transcription Factor Prediction - Knowledge transfer - homologous/nonhomologous analysis

Sequence Evolution and Annotation Alignment and Footprinting Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 Footprinting -Signals (Blanchette) Observable ACGTC ACG-C AGGCC AGGCT AGG-T AGGTT A-CTC

Thorne-Kishino-Felsenstein (1991) Process (birth rate)  (death rate) A # C G ### # T= 0 T = t # P(s) = (1-  )(  ) l  A  #A *.. *  T #T l =length(s) # # # *

&  into Alignment Blocks A. Amino Acids Ignored: e -  t [1-  ](  ) k-1 # # # k # # # # # k  =[1-e (  )t ]/[  e (  )t ] p k (t) p’ k (t) [1-  -  ](  ) k p’ 0 (t)=  (t) * * # # # # k [1-  ](  ) k p’’ k (t) B. Amino Acids Considered: T R Q S W P t (T-->R)*  Q *..*  W *p 4 (t) 4 T R Q S W  R *  Q *..*  W *p’ 4 (t) 4

Basic Pairwise Recursion (O(length 3 )) i j (i,j) i j i-1 j-1 (i-1,j) (i-1,j-1) survive death (i-1,j-k) ………….. Initial condition: p’’=s2[1:j]

 -globin ( 141) and  - globin (146) (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) : -log(  -globin ) : -log(  -globin -->  -globin) : -log(  -globin,  -globin) = -log(l(sumalign)) *t: /  *t: / s*t: / Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) = VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHV DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSD GLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH  - globin  - globin

- # # E # # - E * *  e -   e -   - #  e -   e -    _ #  e -   e -   # -  An HMM Generating Alignments Statistical Alignment Steel and Hein, Holmes and Bruno,2001 C T CAC Emit functions: e(##)=  (N 1 )f(N 1,N 2 ) e(#-)=  (N 1 ), e(-#)=  (N 2 )  (N 1 ) - equilibrium prob. of N f(N 1,N 2 ) - prob. that N 1 evolves into N 2

Why multiple statistical alignment is non-trivial. Steel & Hein, 2001, Hein, 2001, Holmes and Bruno, 2001 * (  ) *###### a s1 s2 s3 *ACG C*TT GT *ACG G # # - 3 # # 4 - # 5 # # 0 # - An HMM generating alignment according to TKF91:

Human alpha hemoglobin; Human beta hemoglobin; Human myoglobin Bean leghemoglobin Probability of data e Probability of data and alignment e Probability of alignment given data * = e Ratio of insertion-deletions to substitutions: Maximum likelihood phylogeny and alignment Gerton Lunter, Istvan Miklos, Alexei Drummond, Yun Song

Metropolis-Hastings Statistical Alignment. Lunter, Drummond, Miklos, Jensen & Hein, 2005 The alignment moves: We choose a random window in the current alignment ALITL---GG ALLTLTTLGG ---TLTSLGA ALLGLTSLG A TNQHVSCTGN GN-HVSCTGK TNQH-SCTLN TNQHVSCTLN QST--QCC-S S------CCS ---QST--QC ALITL---GG ALLTLTTLGG ---TLTSLGA ALLGLTSLGA TNQHVSCTGN GN-HVSCTGK TNQH-SCTLN TNQHVSCTLN QSTQCCS SCCS QSTQC Then delete all gaps so we get back subsequences Stochastically realign this part ALITL---GG ALLTLTTLGG ---TLTSLGA ALLGLTSLGA TNQHVSCTGN GN-HVSCTGK TNQH-SCTLN TNQHVSCTLN QSTQCCS -S--CCS QSTQC-- The phylogeny moves: As in Drummond et al. 2002

Metropolis-Hastings Statistical Alignment Lunter, Drummond, Miklos, Jensen & Hein, 2005

How to proceed to many many sequences ?? Dynamical Programming stops at 4-5 sequences MCMC stops at 10-13ish sequences Some approximations must be adopted “Temporal Corner cutting” Degenerate Genealogical Structures

Many Sequences: Sequence Graphs Istvan Miklos – Gerton Lunter – Miklos Csuros Investigate a set of ancestral sequences/alignments that are computationally realistic A set of homologous sequences are given ccgttagct With a known phylogeny Pairs of sequences are aligned Graphs defined representing alignment/ancestral sequences Pairs of graphs aligned….

FSA - Fast Statistical Alignment Pachter, Holmes & Co k 1 3 k 2 4 Spanning treeAdditional edges An edge – a pairwise alignment 1 2 Data – k genomes/sequences: 1,3 2,3 3,4 3,k 12 2,k 1,4 4,k Iterative addition of homology statements to shrinking alignment: Add most certain homology statement from pairwise alignment compatible with present multiple alignment i. Conflicting homology statements cannot be added ii. Some scoring on multiple sequence homology statements is used.

Li-Stephens Simplifications relative to the Ancestral Recombination Graph (ARG) Are there intermediates between Spanning Trees and Steiner Trees? Local Trees are Spanning Trees – not phylogenies (Steiner Trees) No non-ancestral bridges between ancestral material

Spannoids – k-restricted Steiner Trees Baudis et al. (2000) Approximating Minimum Spanning Sets in Hypergraphs and Polymatroids Spanning tree Steiner tree Spannoid 2-Spannoid Advantage: Decomposes large trees into small trees Questions: How to find optimal spannoid? How well do they approximate?

Example – Contraction of Simulated Coalescent Trees Simulation Trees simulated from the coalescent Spannoid algorithm: Conclusion Approximation very good for k >5 Not very dependent on sequence number

Annotation & Annotation with alignment Annotation Annotation and alignment Three Programs SAPF – dynamic programming up to 4 sequences BigFoot– MCMCup to 13 sequences GRAPEfoot – pairwise genome footprinting Footprinting

The Basics of Evolutionary Annotation Many aligned sequences related by a known phylogeny n 1 positions sequences k 1 slow - r s fast - r f HMM Unobservable U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 Footprinting -Signals (Churchill and Felsenstein, 96)

Statistical Alignment and Footprinting. sequences k 1 Alignment HMM acgtttgaaccgag---- Signal HMM Alignment HMM sequences k 1 acgtttgaaccgag---- sequences k 1 acgtttgaaccgag---- Alignment HMM Structure HMM (A,S)(A,S)

“Structure” does not stem from an evolutionary model Structure HMM S F 0.1 S F F FF 0.9 F FSFS 0.1 S SS 0.9 S SFSF 0.1 The equilibrium annotation does not follow a Markov Chain: ? F F S S F Each alignment in from the Alignment HMM is annotated by the Structure HMM. using the HMM at the alignment will give other distributions on the leaves No ideal way of simulating: using the HMM at the root will give other distributions on the leaves

An example: Footprinting

Summing Out is Better Satija et al.,2008 True positive rate False positive rate True positive rate False positive rate Simulated data with parameter estimated from Eve Stripe 2. DIS – summing out alignments MPP – fixing on 1 alignment As above but with higher insertion-deletion rate.

Signal Factor Prediction Given set of homologous sequences and set of transcription factors (TFs), find signals and which TFs they bind to. Use PWM and Bruno-Halpern (BH) method to make TF specific evolutionary models Drawback BH only uses rates and equilibrium distribution Superior method: Infer TF Specific Position Specific evolutionary model Drawback: cannot be done without large scale data on TF-signal binding.

Knowledge Transfer and Combining Annotations mousepig human Experimental observations Annotation Transfer Observed Evolution Must be solvable by Bayesian Priors Each position p i probability of being j’th position in k’th TFBS If no experiment, low probability for being in TFBS prior 1 experimentally annotated genome (Mouse)

(Homologous + Non-homologous) detection Wang and Stormo (2003) “Combining phylogenetic data with co-regulated genes to identify regulatory motifs” Bioinformatics Zhou and Wong (2007) Coupling Hidden Markov Models for discovery of cis-regulatory signals in multiple species Annals Statistics gene promotor Unrelated genes - similar expressionRelated genes - similar expression Combine above approaches Combine “profiles”

StatAlign software package Written in Java 1.5 Platform-independent graphical interface Jar file is available, no need to instal Open source, extendable modules

Summary The Problem Statistical Alignment - Annotation - Annotation & Statistical Alignment Statistical Alignment The Model The Pairwise Algorithm – the HMM connection the multiple sequence alignment algorithm Annotation Ahead Annotation & Alignment The general problem protein secondary structure – protein genes – RNA structure - signal The general algorithm Signals (footprinting) Protein Secondary Structure Prediction Transcription Factor Prediction - Knowledge transfer - homologous/nonhomologous analysis

Acknowledgements Footprinting: Rahul Satija, Lior Pachter, Gerton Lunter MCMC: Istvan Miklos, Jens Ledet Jensen, Alex Drummond, Program: Adam Novak, Rune Lyngsø Spannoids: Jesper Nielsen, Christian Storm Earlier Statistical alignment Collaborators Mike Steel, Yun Song, Carsten Wiuf, Bjarne Knudsen, Gustav Wiebling, Christian Storm, Morten Møller, Funding BBSRC MRC Rhodes Foundation Software Next steps

Statistical Aligment and Footprinting Statistical Alignment and Footprinting Although bioinformatics perceived is a new discipline, certain parts have a long history and could be viewed as classical bioinformatics. For example, application of string comparison algorithms to sequence alignment has a history spanning the last three decades, beginning with the pioneering paper by Needleman and Wunch, They used dynamic programming to maximize a similarity score based on a cost of insertion-deletions and a score function on matched amino acids. The principle of choosing solutions by minimizing the amount of evolution is also called parsimony and has been widespread in phylogenetic analysis even if there is no alignment problem. This situation is likely to change significantly in the coming years. After a pioneering paper by Bishop and Thompson (1986) that introduced and approximated likelihood calculation, Thorne, Kishino and Felsenstein (1991) proposed a well defined time reversible Markov model for insertion and deletions (the TKF91-model), that allowed a proper statistical analysis for two sequences. Such an analysis can be used to provide maximum likelihood (pairwise) sequence alignments, or to estimate the evolutionary distance between two sequences. Steel et al. (2001) generalized this to any number of sequences related by a star tree. This was subsequently generalized further to any phylogeny and more practical methods based on MCMC has been developed. We have developed this into a generally available program package. Traditional alignment-based phylogenetic footprinting approaches make predictions on the basis of a single assumed alignment. The predictions are therefore highly sensitive to alignment errors or regions of alignment uncertainty. Alternatively, statistical alignment methods provide a framework for performing phylogenetic analyses by examining a distribution of alignments. We developed a novel algorithm for predicting functional elements by combining statistical alignment and phylogenetic footprinting (SAPF). SAPF simultaneously performs both alignment and annotation by combining phylogenetic footprinting techniques with an hidden Markov model (HMM) transducer-based multiple alignment model, and can analyze sequence data from multiple sequences. We assessed SAPF's predictive performance on two simulated datasets and three well-annotated cis-regulatory modules from newly sequenced Drosophila genomes. The results demonstrate that removing the traditional dependence on a single alignment can significantly augment the predictive performance, especially when there is uncertainty in the alignment of functional regions. The transducer-based version of SAPF is currently able to analyze data from up to five sequences. We are currently developing an MCMC approach that we hope will be capable of analyzing data from species, enabling the user to input sequence data from all 12 recently sequenced Drosophila genomes. We will present initial results from the MCMC version of SAPF and discuss some of the challenges and difficulties affecting the speed of convergence.