Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple.

Similar presentations


Presentation on theme: "A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple."— Presentation transcript:

1 A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple alignment - affine gap cost function Jakob Fredslund* (jakobf@birc.dk), Jotun Hein**, Tejs Scharling* * Bioinformatics Research Center, Aarhus University, Denmark ** Department of Statistics, University of Oxford, United Kingdom

2 2 Overview Introduction Examples Gap graph construction Theory Results Conclusions

3 3 Small Parsimony, No Gaps Algorithm due to Finch-Hartigan-Sankoff: Calculate N(A, C, G,T) in each node (minimal cost of subtree rooted at this node with nucleotide X in the root) going up, backtrack going down.

4 4 Small Parsimony, Large Version 1: ac-a---gattc 2: acgac---atcc 3: gc-----gagcc 4: -agacttgt--- 5: aagtcttagt-c g(k) = 12 + 2*k (note: alignment is given)

5 5 Two Steps 1)Find optimal set of indels to explain gaps 2)Assign nucleotides optimally (FHS) So: focus on indels

6 6 Tracing Evolution What events could explain this alignment? cagtta gcag--a -cagtta -cag--a -ctg--a

7 7 Tracing Evolution cagtta cagtta

8 8 Tracing Evolution cagtta caga cagtta cag--a cagtta

9 9 Tracing Evolution cagtta caga ctga cagtta cag--a ctg--a caga

10 10 Tracing Evolution cagtta caga ctga gcag--a cagtta cag--a ctg--a -cagtta -cag--a -ctg--a gcaga

11 11 Indels Affect Full Subtrees cagtta caga ctga gcaga gcag--a -cagtta -cag--a -ctg--a All sequences in right subtree have gaps in blue indel’s position

12 12 Indels Affect Full Subtrees cagtta caga ctga gcaga gcag--a -cagtta -cag--a -ctg--a All sequences in left subtree have gaps in green indel’s position

13 13 Direction of Evolution? cagtta caga ctga gcaga gcag--a -cagtta -cag--a -ctg--a deletion of tt

14 14 Direction of Evolution? cagtta caga ctga gcaga gcag--a -cagtta -cag--a -ctg--a insertion of tt

15 15 Direction of Evolution? cagtta caga ctga gcaga gcag--a -cagtta -cag--a -ctg--a Since we don’t know the direction, we refer to insertions/ deletions as indels. And remember: an indel creates gaps in a full subtree.

16 16 Explaining Gaps With Indels g(k) = a + bk (Anonymous nucleotides denoted by n)

17 17 Explaining Gaps With Indels g(k) = a + bk2*(a+2b)

18 18 Explaining Gaps With Indels g(k) = a + bk2*(a+2b) 3*(a+b)

19 19 Larger Example N8, N9, N10, N11, N12, N13 : ???.. Complex problem! (not aware of any upper time bound)

20 20 Gap Graph Construction Represent in a concise way all gaps and how they are connected: in a graph.

21 21 Gap Intervals 1.Find gap intervals.

22 22 Gap Intervals 1.Find gap intervals. No optimal indel ‘stops’ in the middle of a gap interval: it is cheaper to extend the indel making the first gap than to open a new one. (by triangle inequality)

23 23 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

24 24 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

25 25 Gap Graph Vertices Each vertex represents: a)subtree with gaps in all leaves b)region of alignment 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

26 26 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

27 27 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

28 28 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

29 29 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

30 30 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

31 31 Gap Graph Vertices 2. Create minimal tree coverings: For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

32 32 Gap Graph Connections 3. Create connection between vertices v and w if they represent neighboring gaps.

33 33 Gap Graph Connections 3. Create connection between vertices v and w if they represent neighboring gaps. v → w : all v’s gaps continue in w

34 34 Gap Graph Connections 3. Create connection between vertices v and w if they represent neighboring gaps. v → w : all v’s gaps continue in w

35 35 Gap Graph Connections 3. Create connection between vertices v and w if they represent neighboring gaps. v → w : all v’s gaps continue in w (a special-case connection exists; see paper)

36 36 Special Cases: Cousins (0,1) (1,2,3,4) (0,1,2,3) 3. Create connection between vertices v and w if they represent neighboring gaps. v → w : all v’s gaps continue in w v ~ w : some gaps continue

37 37 Interpreting a Gap Graph Vertex A vertex is a potential indel: one indel could have created all gaps in the subtree. Either one indel created all gaps in the subtree (vertex confirmed),..

38 38 Interpreting a Gap Graph Vertex.. or the vertex is decomposed into several indels (further ‘down’ in the tree). Goal: confirm or decompose vertices with respect to the gap cost function.

39 39 Summary: What Is a Gap Graph? Each vertex represents a subtree in which all nodes have gaps in some region (potential indel).

40 40 Summary: What Is a Gap Graph? Each vertex represents a subtree in which all nodes have gaps in some region (potential indel). A connection between v and w means gaps continuing through both of their regions:

41 41 Summary: What Is a Gap Graph? Each vertex represents a subtree in which all nodes have gaps in some region (potential indel). A connection between v and w means gaps continuing through both of their regions: v → w : all v’s gaps continue in w

42 42 Summary: What Is a Gap Graph? Each vertex represents a subtree in which all nodes have gaps in some region (potential indel). A connection between v and w means gaps continuing through both of their regions: v → w : all v’s gaps continue in w v ~ w : some gaps continue

43 43 Theory Needed Here..

44 44 We Need Optimality Proof A gap graph may be huge, thus representing an enormous number of potential indels. We need to show two things: P1: that all optimal indels are represented in the gap graph; P2: how to ‘resolve the graph’ to determine the set of optimal indels. P1 proved directly in paper (Theorem 1).

45 45 Resolving the Gap Graph In order to determine optimal set of indels, we need to reduce potentially huge graph while keeping the optimal solution! Theorem 2 and a set of following lemmas serve this purpose by identifying certain local graph configurations that can be reduced. Preprocess gap graph (perform local reductions) by applying lemmas.

46 46 Preprocessing Earlier Example Iteratively apply lemmas to reduce the graph..

47 47 Preprocessing Earlier Example Iteratively apply lemmas to reduce the graph..

48 48 Preprocessing Earlier Example Iteratively apply lemmas to reduce the graph..

49 49 Preprocessing Earlier Example Iteratively apply lemmas to reduce the graph..

50 50 Solving Earlier Example After preprocessing: resolve remaining graph by checking all combinations decompose

51 51 Solving Earlier Example Placing indels in the tree:

52 52 After Local Preprocessing In longer examples there will be many undecided vertices (purple) after preprocessing. Find possible decompositions for each vertex and check all combinations in each chain – number of combinations exponential in chain length 

53 53 Execution Times..? Worst-case: exponential. Average times for random alignments with 60% gaps:

54 54 60% gaps is a lot..

55 55 Real Genome Analysis B.ES.89.S61K15, B.FR.83.HXB2, B.GA.88.OYI, B.GB.83.CAM1, B.NL.86.3202A21, B.TW.94.TWCYS, B.US.86.AD87, B.US.84.NY5CG, and B.US.83.SF2 Nine HIV-1 subtypes from the Los Alamos HIV database (tree constructed with Quicktree). Length: 9868. Running Time: 1 sec

56 56 Conclusions Concise way of representing alignment gaps Theoretically sound framework prove optimality Graph reductions lead to fast resolvement


Download ppt "A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple."

Similar presentations


Ads by Google