Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley.

Similar presentations


Presentation on theme: "Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley."— Presentation transcript:

1

2 Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley

3 State of the Genomes (Jan 2004) v3v6v2v3v34v3.1v0.1v1 v0  ---- 0.36 Gb 0.35 Gb 1.7 Gb 2.5 Gb 2.9 Gb 2.8 Gb 2.4* Gb 2.9* Gb 1.2 Gb 3* Gb 1.7 Gb Aligned (multiple)Working on itAs soon as released

4 Computational Foundations of Comparative Genomics Alignment Phylogeny Annotation Multi HMM Generalized HMM General Reversible Model Generalized Multi HMM Evol. HMM Generalized hidden Markov Phylogeny HANDEL SLAM DOUBLESCAN GENIE GENSCAN GENEID ALL EXONIPHY SHADOWER PHYLIP PAUP Graphical Models

5 Better models More Parameters Less Robustness Example: Gene finding Example: Needleman-Wunsch Alignment 2 parameters: 7 alignments 4 parameters: 224 alignments

6 Running Example: HMM ACG

7 A simple HMM Initial distribution: G s RG s GR s GG s RR R t R (A)=1/2 t R (C,G,T)=1/6 t G (A,C,G,T)=1/4  = (  R,  G )

8 A lattice view G R G ATT AC Observed sequence: RGGGRR Hidden sequence:

9 Algebraic Representation HMM:  ff SS TTT

10 Example: The image of f is the zero set of the quartic polynomial: SS TTT binary random variables

11 Questions: 1.What is the probability of the observed sequence? 2.What is the most likely sequence of dice used? Observed:  = ATTACGAGCA… Inference

12 Sum-Product Algorithm

13 Example of an interesting polynomial to evaluate: The polynomial f has n! terms but can be evaluated efficiently: (+,x) semi-ring:Gaussian elimination O(n 3 ) (min,+) semi-ring:Hungarian algorithm O(n 3 )

14 Sum-Product Algorithm Graph Interpretation A B Questions: 1.What is the probability of the observed sequence? (sum of all paths) 2.What is the most likely sequence of dice used? (weight max. path)

15 Parametric Inference Question: How does g  depend on U,V? Biological interpretation: How does the annotation, alignment, detected motif, phylogenetic tree, inferred ancestral sequence,… depend on the parameters?

16 Example: Parametric Sequence Alignment D. Gusfield, K. Balasubramanian, and D. Naor: Parametric optimization of sequence alignment, Algorithmica 12, 1994, 312- -326. D. Gusfield and P. Stelling: Parametric and inverse-parametric sequence alignment with XPARAL, Methods Enzymology 266, 1996, 481--494. M. Waterman, M. Eggert and E. Lander: Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, 1992, 6090--6093.

17

18 Solution to Parametric Inference Proposed answer: a new efficient algorithm called the polytope propagation algorithm... Working in the mathematical setting of tropical geometry...

19 Newton Polytope of a Polynomial Definition: The Newton polytope of a polynomial is defined to be the convex hull of the lattice points in R d corresponding to monomials in f:

20 Example: Newton Polytope F(x,y) = 1+x+x 2 y 3 +xy 4 +x 3 y 2 +xy (0, 0) (1, 4) (3, 2) (1, 0)

21 Newton Polytopes & Inference (0, 0) (1, 4) (3, 2) (1, 0) Viterbi sequence

22 Newton Polytopes and Parametric Inference The linear functionals that maximize on a vertex v are called the normal cone of v. The collection of fans forms the normal cone of the polytope. Finding the Newton polytope, and its normal fan, solves the parametric inference problem

23

24 Example: Newton Polytope of a CpG island HMM Sequence of length 8: ATAAGGCG Equal output probabilities in non-CpG State Pyrimidines & Purines treated equally

25 How large can Newton polytopes be? Theorem: Consider graphical models f whose number of parameters d is fixed and whose number of observed random variables n and edges e varies. Then the number of vertices of the Newton polytope of f  is bounded above by  vertices(NP(f   Ce d(d-1)/(d+1) Proof (Andrews 1964): For every fixed integer d there exists a constant Cd such that the number of vertices of the lattice polytope P in R d is bounded above by C d Vol(P) (d-1)/(d+1)

26 Polytope Propagation Algorithm + : Convex hulls of unions of polytopes x : Minkowski sums of polytopes

27 Polytope Propagation Algorithm 0101101011

28 Complexity of polytope propagation Proportional to the running time of the sum- product algorithm. Convex hull and Minkowski sum computations only depend on the size of the Newton polytopes. Convex hull computations are of unions of polytopes. Example: 2-parameter alignment O(nm|N(P)|)

29 Alignment with 4 parameters: Sequences: 1: AGGACCGATTACAGTTCAA 2: TTCCTAGGTTAAACCTCATGCA Parameters: Match, Mismatch, GapOpen, GapExtend POLYMAKE -- www.math.tu-berlin.de/polymake/ www.math.tu-berlin.de/polymake/

30 Example: Naïve Bayes Model f:R 12 ---> R 9, f ij = s i0 t 0j +s i1 t 1j I f = g ij = min(u i0 +v 0j,u i1 +v 1j )  + (I f ),  (I f ), 6

31 Summary: Comparative Genomics, Graphical Models and the Geometry -- Statistics Dictionary Computational foundations: Graphical models Sum product algorithm -- inference Polytope propagation -- parametric inference Mathematical foundations: Algebraic geometry Algebraic varieties -- the model Amoebas -- the model in log probabilities Tropical varieties -- parametric MAP inference


Download ppt "Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley."

Similar presentations


Ads by Google