Download presentation
Presentation is loading. Please wait.
1
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter
2
Alignment Pair Hidden Markov Models Steiner Networks ATCG--G A-CGTCA M X Y biologically meaningful fast alignments based on HMM structure
3
Some basic definitions: Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k times the distance between u and v in G. Let V(G)=R 2 and E(G)=horizontal and vertical line segments. A Manhattan network is a 1-spanner for a set S of points in R 2. Vertices in the Manhattan network that are not in S are called Steiner points
4
Example: S: red points Manhattan network Steiner point
5
[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points 4-approximation in O(n 3 ) and 8-approximation in O(nlogn)
6
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid
7
A(v) = {u:v is the topmost node below and to the left of u} [Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations) v slide
8
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide The minimum slide arborescense problem: Lingas-Pinter-Rivest-Shamir 1982 O(n 3 ) optimal solution using dynamic programming
9
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness u v b a
10
What is an alignment? ATCG--GACATTACC-AC AC-GTCA-GATTA-CAAC
11
M X Y M = (mis)match X = insert seq1 Y = insert seq2 Pair HMMs Simple sequence-alignment PHMM
12
MX Y M M Y M Hidden sequence: A A T C C - G G - T - C G A Observed sequence: ATCGG ACGTCA Hidden alignment: ATCG--G AC-GTCA Pair HMMs transition probabilities output probabilities
13
Using the Pair HMM In practice, we have observed sequence ATCGG ACGTCA for which we wish to infer the underlying hidden states One solution: among all possible sequences of hidden states, determine the most likely (Viterbi algorithm). ATCG--G AC-GTCA MMXMYYM
14
Viterbi in PHMM Needleman Wunsch M X Y 1 - 3 1 - 3 1 - 3 1 - 3 1 - 3 1 - 3 Match prob: p m Mismatch prob: p r Match score: log(p m ) Mismatch score: log(p r ) Gap score: log(p g ) Gap prob: p g
15
Want to take into account that the sequences are genomic sequences: Example: a pair of syntenic genomic regions
16
YX PHMM M X Y
17
YX A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight
18
Strategy for Alignment GATTACATTGATCAGACAGGTGAAGA G A T C T T C A T G T A G
19
The CD4 region human mouse 50000 0 0
20
5’ 3’ Exon 1 Exon 2Exon 3Exon 4 Intron 1Intron 2Intron 3 Branchpoint CTG A C Splice site CAG Splice site GGTGAG Translation Initiation ATG Stop codon TAG/TGA/TAA
21
Suggests a new Steiner problem Find the shortest 1-spanner connecting reds to blues
22
Generalizes the Manhattan network problem (all points red and blue) Generalizes the Rectilinear Steiner Arborescence problem
23
1985, Trubin - polynomial time algorithm History of the Rectilinear Steiner Arborescence Problem 1992, Rao-Sadayappan-Hwang-Shor - error in Trubin 2000, Shi and Su - NP complete!
24
Results for unlabeled problem An O(n 3 ) 2-approximation algorithm (implemented) An O(nlogn) 4-approximation algorithm Testing on CD4 region in human/mouse Implementation ( SLIM ) http://bio.math.berkeley.edu/slim/ SLIM for SLAM (in progress) http://bio.math.berkeley.edu/slam/
25
TAAT GTATTG AG GTATTG AG TG AA CTGGTTGGTCCTCAGGTGTGTC ATGTCCACGG G A GT T A C A TC TTGTACACGGCAG T GT A C G CT GG ATGTAAC A C A T G TA X CNS Y M D I
26
The Viterbi graph for a more complicated alignment PHMM
27
Comparison and Analysis of Performance Our method has two main steps: (L=length of seqs, n=#HSP) 1.Building the network O(n 3 ) or O(nlogn) 2.Running the Viterbi algorithm O(nL) worst case for the HMM on the network Banding algorithms are O(L 2 ) worst case for step 2. Chaining algorithms are O(n 2 ) in the case where gap penalties can depend on the sequences. These strategies do not generalize well for more sophisticated HMMs.
28
Summary Thanks: Nick Bray and Simon Cawley SLIM (network build): http://bio.math.berkeley.edu/slim/http://bio.math.berkeley.edu/slim/ SLAM (alignment): http://bio.math.berkeley.ed/slam/http://bio.math.berkeley.ed/slam/ ATCG--G A-CGTCA M X Y Software:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.