Intro to Comp Genomics Lecture 11: Using models for sequence evolution.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Phylogenetic Trees Lecture 4
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Intro to comp genomics Lecture 3-4: Examples, Approximate Inference.
Lecture 5: Learning models using EM
Introduction to Bioinformatics Algorithms Sequence Alignment.
Introduction to bioinformatics
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Similar Sequence Similar Function Charles Yan Spring 2006.
Genome evolution: a sequence-centric approach Lecture 10: Selection in protein coding genes.
Genome Evolution. Amos Tanay 2009 Genome evolution Lecture 7: Selection in protein coding genes.
Genome evolution: a computational approach Lecture 1: Modern challenges in evolution. Markov processes. Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Genome Evolution. Amos Tanay 2009 Genome evolution Lecture 4: Species, Genomes and Trees.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular phylogenetics
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Analysis-III
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Genome Evolution. Amos Tanay 2010 Genome evolution Lecture 6: Inference through sampling. Basic phylogenetics.
Sequence Alignment.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Protein Sequence Alignment Multiple Sequence Alignment
Genome Evolution. Amos Tanay 2010 Genome evolution Lecture 5: Species, Genomes and Trees.
Modelling evolution Gil McVean Department of Statistics TC A G.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity, BLAST alignments & multiple sequence alignments
Goals of Phylogenetic Analysis
The Most General Markov Substitution Model on an Unrooted Tree
Presentation transcript:

Intro to Comp Genomics Lecture 11: Using models for sequence evolution

Comparing everything Our intuition: Feature X similar among a group of species -> Feature X is important Feature X can be: Sequence Gene expression (human brain vs chimp brain?) Genic structure (Exon/intron) Protein complexes Protein networks TF-DNA interaction Two main difficulties: Species have common ancestry – a lot of stuff may be similar just because it did not diverge yet Species are related through phylogenetic trees – similarity should be following a tree structure

Modeling multiple genome sequences Genome 1 Genome 2 AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. AGCAACAAGAAAGGGTTACTACGCAGAAAA…. Alignment Statistics Genome 3 A C G T ACGT A C G T s s ss s Markov process Unobserved ancestral

Tree models H2H2 S3S3 S2S2 S1S1 H1H1 For a locus j: Extant Species S j 1,..,n Ancestral species H j 1,..(n-1) Tree T: Parents relation pa S i, pa H i (pa S 1 = H 1,pa S 3 = H 2 The root: H 2) Val(X) = {A,C,G,T} An evolutionary model = a joint distribution Pr(h,s) Locus independence:

Tree models A Tree: T, Species S 1,..,n Parents relation pa S i Markov assumption still in effect..but branching complicates it C C C A We need a little more: The model: In the triplet:

Tree models Toy model: Triplet phylogeny Substitution probability on all of the branches: Uniform background probability: P(x) = 0.25 H2H2 S3S3 S2S2 S1S1 H1H1

Tree models Marginal probability of X i (any r.v.) : Given partial observations s: “ancestral inference” The Total probability of the data: likelihood of the model given the data H2H2 S3S3 S2S2 S1S1 H1H1

Tree models ? A CA ? Given partial observations s: The Total probability of the data: ? ?

Intuition – maximum parsimony ? A CA ? “Parsimony” ~ minimal change The “small” parsimony problem: Find ancestral sequences that minimize the number of substitutions along the tree branches What is the minimal number of substitutions? (All branches are equal, all substitutions are equal) (The “big” parsimony problem: Find the tree topology that gives minimal parsimony score given a set of loci) C C 2 substitutions A A 1 substitution

Algorithm (Following Fitch 1971): Up(i): if(extant) { up_set[i] = S i ; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Down(i): down_set[i] = up_set[sibling[i]] ∩ down_set[par(i)] if(down_set[i] = 0) { down_set[i] = up_set[sibling[i]] + down_set[par(i)] } Algorithm: D=0 up(root); down_set[root] = 0; down(right(root)); down(left(root)); Intuition – maximum parsimony ? S3S3 S2S2 S1S1 ? up_set[4] up_set[5]

Algorithm (Following Fitch 1971): Up(i): if(extant) { up_set[i] = S i ; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Down(i): down_set[i] = up_set[sib[i]] ∩ down_set[par(i)] if(down_set[i] = 0) { down_set[i] = up_set[i] } down(left(i)), down(right(i)) Algorithm: D=0 up(root); down_set[root] = 0; down(right(root)); down(left(root)); Intuition – maximum parsimony ? S3S3 S2S2 S1S1 ? down_set[4] down_set[5] up_set[3] Set[i] = up_set[i] ∩ down_set[i]

Algorithm (Following Felsenstein 1981): Up(i): if(extant) { up[i][a] = (a==S i ? 1: 0); return} up(r(i)), up(l(i)) iter on a up[i][a] =  b,c Pr(X l(i) =b|X i =a) up[l(i)][b] Pr(X r(i) =c|X i =a) up[r(i)][c] Down(i): down[i][a]=  b,c Pr(X sib(i) =b|X par(i) =c) up[sib(i)][b] Pr(X i =a|X par(i) =c) down[par(i)][c] down(r(i)), down(l(i)) Algorithm: up(root); LL = 0; foreach a { L += log(Pr(root=a)up[root][a]) down[root][a]=Pr(root=a) } down(r(root)); down(l(root)); Probabilistic inference ? S3S3 S2S2 S1S1 ? up[4] up[5] Felsentstein

Algorithm (Following Felsenstein 1981): Up(i): if(extant) { up[i][a] = (a==S i ? 1: 0); return} up(r(i)), up(l(i)) iter on a up[i][a] =  b,c Pr(X l(i) =b|X i =a) up[l(i)][b] Pr(X r(i) =c|X i =a) up[r(i)][c] Down(i): down[i][a]=  b,c Pr(X sib(i) =b|X par(i) =c) up[sib(i)][b] Pr(X i =a|X par(i) =c) down[par(i)][c] down(r(i)), down(l(i)) Algorithm: up(root); LL = 0; foreach a { L += log(Pr(root=a)up[root][a]) down[root][a]=Pr(root=a) } down(r(root)); down(l(root)); ? S3S3 S2S2 S1S1 ? down[4] down5] up[3] P(h i |s) = up[i][c]*down[i][c]/ (  j up[i][j]down[i][j]) Probabilistic inference Felsentstein

Inference as message passing s s ss s s s You are P(H|our data) I am P(H|all data) DATA

Inference as message passing AC C C DATA Up: (0.01) 2,(0.96) 2,(0.01) 2,(0.02) 2 Down: (0.25),(0.25),(0.25),(0.25) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Learning: Branch decomposition Can we learn each branch independently? We know how to compute the ML model given two observed species We have P(S|D) for each species, can we substitute it for the statistics: A G C T AGCTAGCT

Transition posteriors: not independent! AC A C DATA Down: (0.25),(0.25),(0.25),(0.25) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Learning: Second attempt Can we learn each branch independently? Given P(S pai ->S i |D) for each species, can we substitute it for the statistics?

Expectation-Maximization

Continuous time Conditions on transitions: Theorem: exists (may be infinite) exists and finite Think of time steps that are smaller and smaller Markov Kolmogorov

Rates and transition probabilities The process’s rate matrix: Transitions differential equations (backward form):

Matrix exponential The differential equation: Series solution: 1-path2-path3-path4-path5-path Summing over different path lengths:

Computing the matrix exponential

Series methods: just take the first k summands reasonable when ||A||<=1 if the terms are converging, you are ok can do scaling/squaring: Eigenvalues/decomposition: good when the matrix is symmetric problems when having similar eigenvalues Multiple methods with other types of B (e.g., triangular)

Learning a rate matrix What if we wish to learn a single rate matrix Q? Learning is easy for a single, fixed length branch. Given (inferred) statistics n k for multiple branch lengths, we must optimize a non linear likelihood function

Learning: Sharing the rate matrix Use generic optimization methods: (BFGS)

Protein genes: codes and structure 123 codons Intron/exons Domains Conformation Degenerate code Recombination easier? Epistasis: fitness correlation between two remote loci 5’ utr3’ utr

The classical analysis paradigm BLAT/BLAST Target sequence Genbank Matching sequences CLUSTALW ACGTACAGA ACGT--CAGA ACGTTCAGA ACGTACGGA Alignment Phylogenetic Modeling Analysis: rate, Ka/Ks…

Clustalw and multiple alignment ClustalW is the semi-standard multiple alignment algorithm when sequences are relatively diverged and there are many of them ClustalW Compute pairwise sequence distances (using pairwise alignment) Build a guide-tree: approximating the phylogenetic relations between the sequences “Progressive” alignment on the guide tree S2S1 S4 S3 S5 Dist(s1,s2) = best pair align Distance Matrix Neighbor Joining Guide tree is based on pairwise analysis! From the leafs upwards: Align two children given their “profiles” Several heuristics for gaps Other methods are used to multi-align entire genomes, especially when one well annotated model genome is compared to several similar species. Think of using one genome as a “scaffold” for all other sequences.

Nucleotide substitution models For nucleotides, fewer parameters are needed: A CT G      A CT G      Jukes-Kantor (JK) Kimura But this is ignoring all we know on the properties of amino-acids!

Simple phylogenetic modeling: PAM/BLOSSOM62 Given a multiple alignment (of protein coding DNA) we can convert the DNA to proteins. We can then try to model the phylogenetic relations between the proteins using a fixed rate matrix Q, some phylogeney T and branch lengths t i When modeling hundreds/thousands amino acid sequences, we cannot learn from the data the rate matrix (20x20 parameters!) AND the branch lengths AND the phylogeny. Based on surveys of high quality aligned proteins, Margaret Dayhoff and colleuges generated the famous PAM (Point Accepted mutations): PAM1 is for 1% substitution probability. Using conserved aligned blocks, Henikoff and Henikoff generated the BLOSUM family of matrices. Henikoff approach improved analysis of distantly related proteins, and is based on more sequence (lots of conserved blocks), but filtering away highly conserved positions (BLOSUM62 filter anything that is more than 62% conserved)

Universal amino-acid substitution rates? Jordan et al., Nature 2005 “We compared sets of orthologous proteins encoded by triplets of closely related genomes from 15 taxa representing all three domains of life (Bacteria, Archaea and Eukaryota), and used phylogenies to polarize amino acid substitutions. Cys, Met, His, Ser and Phe accrue in at least 14 taxa, whereas Pro, Ala, Glu and Gly are consistently lost. The same nine amino acids are currently accrued or lost in human proteins, as shown by analysis of non- synonymous single-nucleotide polymorphisms. All amino acids with declining frequencies are thought to be among the first incorporated into the genetic code; conversely, all amino acids with increasing frequencies, except Ser, were probably recruited late. Thus, expansion of initially under-represented amino acids, which began over 3,400 million years ago, apparently continues to this day. “

You task Get aligned chromosome 17 for human, chimp, orangutan, rhesus, marmoset Use EM on the known phylogeny to estimate a substitution model from the data (P(x|pax)) Partition the genome into two parts according to overall conservation (define the criterion yourself). Then train independently two models and compare them. Optional: can your models be explained using a single rate matrix and different branch lengths?