Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Coloring Warm-Up. A graph is 2-colorable iff it has no odd length cycles 1: If G has an odd-length cycle then G is not 2- colorable Proof: Let v 0, …,
Hidden Markov Model in Biological Sequence Analysis – Part 2
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
1 A Faster Approximation Algorithm For The Steiner Problem In Graphs Kurt Mehlhorn. Information Processing Letters, 27(3):125–128, 高等演算法二
1 The TSP : Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell ( )
Ab initio gene prediction Genome 559, Winter 2011.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Lecture 6, Thursday April 17, 2003
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Dynamic Programming II
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Hidden Markov Models In BioInformatics
(Combinatorics of) Alignment and Gene Finding Lior Pachter Basic definitions (alignment) Combinatorics of alignment Pair hidden Markov models Alignment.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Outline More exhaustive search algorithms Today: Motif finding
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Minimum Spanning Trees CS 146 Prof. Sin-Min Lee Regina Wang.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
California Pacific Medical Center
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
A randomized linear time algorithm for graph spanners Surender Baswana Postdoctoral Researcher Max Planck Institute for Computer Science Saarbruecken,
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
1. For minimum vertex cover problem in the following graph give
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
bacteria and eukaryotes
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
CS502: Algorithms in Computational Biology
Eukaryotic Gene Finding
Ab initio gene prediction
Enumerating Distances Using Spanners of Bounded Degree
Reference based assembly
Greedy Algorithms / Dijkstra’s Algorithm Yin Tat Lee
Pair Hidden Markov Model
Three classic HMM problems
CS 581 Tandy Warnow.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Modeling of Spliceosome
Presentation transcript:

Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter

Alignment Pair Hidden Markov Models Steiner Networks ATCG--G A-CGTCA M X Y biologically meaningful fast alignments based on HMM structure

Some basic definitions: Let G be a graph and S  V(G). A k-spanner for S is a subgraph G’  G such that for any u,v  S the length of the shortest path between u,v in G’ is at most k times the distance between u and v in G. Let V(G)=R 2 and E(G)=horizontal and vertical line segments. A Manhattan network is a 1-spanner for a set S of points in R 2. Vertices in the Manhattan network that are not in S are called Steiner points

Example: S: red points Manhattan network Steiner point

[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points 4-approximation in O(n 3 ) and 8-approximation in O(nlogn)

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

A(v) = {u:v is the topmost node below and to the left of u} [Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations) v slide

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide The minimum slide arborescense problem: Lingas-Pinter-Rivest-Shamir 1982 O(n 3 ) optimal solution using dynamic programming

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness u v b a

What is an alignment? ATCG--GACATTACC-AC AC-GTCA-GATTA-CAAC

M X Y M = (mis)match X = insert seq1 Y = insert seq2 Pair HMMs Simple sequence-alignment PHMM

MX Y M M Y M Hidden sequence: A A T C C - G G - T - C G A Observed sequence: ATCGG ACGTCA Hidden alignment: ATCG--G AC-GTCA Pair HMMs transition probabilities output probabilities

Using the Pair HMM In practice, we have observed sequence ATCGG ACGTCA for which we wish to infer the underlying hidden states One solution: among all possible sequences of hidden states, determine the most likely (Viterbi algorithm). ATCG--G AC-GTCA MMXMYYM

Viterbi in PHMM Needleman Wunsch M X Y Match prob: p m Mismatch prob: p r Match score: log(p m ) Mismatch score: log(p r ) Gap score: log(p g ) Gap prob: p g

Want to take into account that the sequences are genomic sequences: Example: a pair of syntenic genomic regions

YX PHMM M X Y

YX A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight

Strategy for Alignment GATTACATTGATCAGACAGGTGAAGA G A T C T T C A T G T A G

The CD4 region human mouse

5’ 3’ Exon 1 Exon 2Exon 3Exon 4 Intron 1Intron 2Intron 3 Branchpoint CTG A C Splice site CAG Splice site GGTGAG Translation Initiation ATG Stop codon TAG/TGA/TAA

Suggests a new Steiner problem Find the shortest 1-spanner connecting reds to blues

Generalizes the Manhattan network problem (all points red and blue) Generalizes the Rectilinear Steiner Arborescence problem

1985, Trubin - polynomial time algorithm History of the Rectilinear Steiner Arborescence Problem 1992, Rao-Sadayappan-Hwang-Shor - error in Trubin 2000, Shi and Su - NP complete!

Results for unlabeled problem An O(n 3 ) 2-approximation algorithm (implemented) An O(nlogn) 4-approximation algorithm Testing on CD4 region in human/mouse Implementation ( SLIM ) SLIM for SLAM (in progress)

TAAT GTATTG AG GTATTG AG TG AA CTGGTTGGTCCTCAGGTGTGTC ATGTCCACGG G A GT T A C A TC TTGTACACGGCAG T GT A C G CT GG ATGTAAC A C A T G TA X CNS Y M D I

The Viterbi graph for a more complicated alignment PHMM

Comparison and Analysis of Performance Our method has two main steps: (L=length of seqs, n=#HSP) 1.Building the network O(n 3 ) or O(nlogn) 2.Running the Viterbi algorithm O(nL) worst case for the HMM on the network Banding algorithms are O(L 2 ) worst case for step 2. Chaining algorithms are O(n 2 ) in the case where gap penalties can depend on the sequences. These strategies do not generalize well for more sophisticated HMMs.

Summary Thanks: Nick Bray and Simon Cawley SLIM (network build): SLAM (alignment): ATCG--G A-CGTCA M X Y Software: