Using PQ Trees For Comparative Genomics - CPM 20051 Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Recursive Definitions and Structural Induction
1/44 A simple Test For the Consecutive Ones Property.
Determinization of Büchi Automata
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
Applied Discrete Mathematics Week 12: Trees
CS5371 Theory of Computation
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Data Structures – LECTURE 10 Huffman coding
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial.
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.
Applied Discrete Mathematics Week 10: Equivalence Relations
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Induction and recursion
Simple Efficient Algorithm for MPQ-tree of an Interval Graph Toshiki SAITOH Masashi KIYOMI Ryuhei UEHARA Japan Advanced Institute of Science and Technology.
A Test for the Consecutive Ones Property 1/39. Outline Consecutive ones property PQ-trees Template operations Complexity Analysis –The most time consuming.
Chapter 2 Graph Algorithms.
Section 5.3. Section Summary Recursively Defined Functions Recursively Defined Sets and Structures Structural Induction.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
Chapter 9. Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations.
Algorithms and Running Time Algorithm: Well defined and finite sequence of steps to solve a well defined problem. Eg.,, Sequence of steps to multiply two.
Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided by author Slides edited for.
1 Chapter 1 Introduction to the Theory of Computation.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
The Integers. The Division Algorithms A high-school question: Compute 58/17. We can write 58 as 58 = 3 (17) + 7 This forms illustrates the answer: “3.
Chapter 9. Section 9.1 Binary Relations Definition: A binary relation R from a set A to a set B is a subset R ⊆ A × B. Example: Let A = { 0, 1,2 } and.
Based on slides by Y. Peng University of Maryland
Introduction to Planarity Test W. L. Hsu. Plane Graph A plane graph is a graph drawn in the plane in such a way that no two edges intersect A plane graph.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Reading and Writing Mathematical Proofs Spring 2015 Lecture 4: Beyond Basic Induction.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
Chapter 7. Trees Weiqi Luo ( 骆伟祺 ) School of Software Sun Yat-Sen University : Office : A309
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
Problem Statement How do we represent relationship between two related elements ?
Chapter 9. Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
Mathematical Induction Section 5.1. Climbing an Infinite Ladder Suppose we have an infinite ladder: 1.We can reach the first rung of the ladder. 2.If.
SECTION 9 Orbits, Cycles, and the Alternating Groups Given a set A, a relation in A is defined by : For a, b  A, let a  b if and only if b =  n (a)
Huffman Codes. Overview  Huffman codes: compressing data (savings of 20% to 90%)  Huffman’s greedy algorithm uses a table of the frequencies of occurrence.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.
1/44 A simple Test For the Consecutive Ones Property Without PC-trees!
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
Chapter 5 With Question/Answer Animations 1. Chapter Summary Mathematical Induction - Sec 5.1 Strong Induction and Well-Ordering - Sec 5.2 Lecture 18.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Discrete Structures Li Tak Sing( 李德成 ) Lectures
PC-Trees Based on a paper by Hsu and McConnell. Talk Outline We Define the consecutive ones and circular ones problems We show PQ Trees – the traditional.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
1 Minimum Routing Cost Tree Definition –For two nodes u and v on a tree, there is a path between them. –The sum of all edge weights on this path is called.
Equivalence Relations
Relations Chapter 9.
PC trees and Circular One Arrangements
Enumerating Distances Using Spanners of Bounded Degree
CS 581 Tandy Warnow.
CSE 589 Applied Algorithms Spring 1999
COMPS263F Unit 2 Discrete Structures Li Tak Sing( 李德成 ) Room A
Phylogeny.
Presentation transcript:

Using PQ Trees For Comparative Genomics - CPM Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson Oren Weimann – Univ. of Haifa

Using PQ Trees For Comparative Genomics - CPM Gene Clusters  Genes that appear together consistently across genomes are believed to be functionally related, however the ordering doesn’t have to be the same. Genome 1 Genome 2 Genome 3 Genome 4 Genome 5

Using PQ Trees For Comparative Genomics - CPM What is a  Pattern? [WABI04]  Given a string S=“s 1 s 2 s 3 ….s n ” and an integer K, a pattern P={p 1,p 2,p 3,…,p m } is a  pattern if P occurs (possibly permuted) in at least K places in S.  Example: S =a b c d b a c d a b a c b P = {a,b,c} K=4 P is a 4-  Pattern with location-list = {1,5,10,11}  For the moment we will assume that every character appears once in the pattern.

Using PQ Trees For Comparative Genomics - CPM S = a b c d e b a d c e Maximal  Patterns  Maximal notation - a representation of a maximal  pattern p that illustrates all the non-maximal  patterns with respect to p.  Our goal: Find all  patterns p and their maximal notation.  Our solution – a linear time algorithm based on PQ trees. S = a b c d e b a d c e {a,b} is non-maximal with respect to {a,b,c,d,e} The maximal notation of {a,b,c,d,e} is ((a,b)-(c,d)-e)

Using PQ Trees For Comparative Genomics - CPM PQ trees: Booth, Lueker Definitions  PQ trees [Booth, Lueker, 1976]  Character labeled leaves.  P-nodes: Represent “truly permuted” components Arbitrary permutations of children  Q-nodes: Represent bi-connected components Only “reversion” B EF GH I JK B D AC D

Using PQ Trees For Comparative Genomics - CPM PQ trees: Definitions  Equivalent PQ trees ( ). EF GH I JK B D AC EF GH I JK B D AC

Using PQ Trees For Comparative Genomics - CPM PQ trees: Definitions  FRONTIER:  C(T)= the set of frontiers of all trees equivalent to T: EF GH I JK B D AC FRONTIER(T)=“A B C D E F G H I J K” FRONTIER(T)=“A B C G H I J K E F D" Theorem: If C(T 1 )=C(T 2 ) then T 1 T 2.

Using PQ Trees For Comparative Genomics - CPM Our Use of the PQ tree  Suppose the  Pattern {a,b,c,d} appears in 4 locations as:   = { abcd, acbd, dbca, dcba }. Our goal: C(T) = { abcd, acbd, dbca, dcba }. Write the P-nodes as ‘,’ and the Q-nodes as ‘-’ and get: (a-(b,c)-d) which is exactly the maximal notation of the  Pattern {a,b,c,d} bc a d

Using PQ Trees For Comparative Genomics - CPM The minimal Consensus PQ tree  It is not always possible to find a tree T where  =C(T):  Consider a  Pattern {a,b,c,d} that appears as:  = { abcd, bdac }. { abcd, bdac } C(T)  Given permutations  ={  1,  2,…,  k }, the consensus PQ tree T of  is such that  C(T), and the consensus is minimal when there exists no other T’ such that  C(T’) and |C(T’)| |C(T)|.  The problem of obtaining a maximal notation for a  Pattern is the same as obtaining a minimal consensus PQ tree of all the k occurrences.  Theorem: The minimal consensus PQ tree T is unique. bcad

Using PQ Trees For Comparative Genomics - CPM The original use of the PQ Tree  The consecutive 1’s problem: The restriction sets: F = { {a,b,c}, {b,c}, {b,c,d}, {b} } The solution [Booth, Lueker, 1976]: Reduce(F )=  The result will be C(T), in our case C(T)={abcd, acbd, dbca, dcba} and the tree was constructed in O( ) time (for an n x n matrix) (Reduce(F) by [Booth, Lueker, 1976]) a b c d bc a d

Using PQ Trees For Comparative Genomics - CPM Obtaining the Minimal Consensus PQ tree  Some definitions [Heber, Stoye, 2001]:  Common interval – an interval that appears as a consecutive sequence in all the appearances. [4-8] in the example.  We denote = all Common intervals = { [1-2],[2-3],[1-3],[1-9],[1-8],[4-5],[4-6],[4-7],[4-8],[5-6] }  A list p of common intervals is a chain if every two successive intervals in p have a non-trivial overlap. For example P=([1-2],[2-3])  A common interval is called reducible if there is a chain that generates it, otherwise it is called irreducible. [1-3] is a reducible interval since it can be generated by the irreducible intervals [1-2],[2-3]  We denote = all irreducible intervals of  = { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } 11 22 3 11 22 3 11 22 33

Using PQ Trees For Comparative Genomics - CPM  Theorem: Reduce( ) = Reduce( ) = minimal consensus tree.  The Algorithm:  Compute. { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] }  Compute Reduce( ) to get the minimal consensus tree of .  The  Pattern notation is: ((1-2-3)-(((4-5-6),7),8)-9)  Time Complexity: For a a  pattern of size n that appears in k places it takes a total of O(kn+ ) to compute the maximal notation. Obtaining the Minimal Consensus PQ tree 11 22 3

Using PQ Trees For Comparative Genomics - CPM Improving the Time Complexity to O(kn)  In Heber & Stoye’s algorithm for obtaining, a data structure S was maintained to hold the chains of the irreducible intervals: = { [1-2],[1-8],[2,3], [4,5],[4,8],[4,8],[5,6] }  REPLACE(S):  Replace every chain by a Q node.  Replace every element that is not a leaf or a Q node and is pointed by a vertical link with a P node 11 22 3

Using PQ Trees For Comparative Genomics - CPM Maximal  Patterns and Sub-Trees  A sub-tree of the PQ tree T is obtained by picking a P-node in T with all it’s descendants, or by picking a Q-node in T with any number of consecutive descendants.  Suppose the  Pattern {a,b,c,d} appears in 4 locations as:   = { abcd, acbd, dbca, dcba }.  Theorem: If p1 and p2 are  patterns, and p1 is non-maximal with respect to p2, then the PQ Tree T1 that represents p1 is a sub-tree of the PQ tree T2 that represents p2. bc a d

Using PQ Trees For Comparative Genomics - CPM So what did we achieve?  A first algorithm (and optimal in time) that generates the maximal notation of a pattern.  A “bottom-up” construction of a PQ tree.  A visualization of the inner structure of a pattern.  Filtering of meaningful from apparently meaningless (non- maximal) clusters.  Experimental results that prove this tool can aid in predicting gene functions.  Clustering for the various genome models.

Using PQ Trees For Comparative Genomics - CPM Using Our Tool for Various Genome Models  Genome model I (orthologs only): A sequence is a permutation of the set {1,2…,n}. Only one maximal  pattern {1,2….,n}. In O(kn) time we get a PQ tree that describes all patterns of all sizes and their non-maximal relations.

Using PQ Trees For Comparative Genomics - CPM Using Our Tool for Various Genome Models  Genome model II : A gene may appear once in a sequence or not appear at all in that sequence. We can extend the algorithm to work on sequences that are not permutations of the same set in :  Example: consider the 2 sequences and ’ 6 7 8’ and ’ ’ add characters as needed: Build PQ Tree on the new sequences: ‘5 The sub-trees that have no red leaves Are all the maximal patterns 5’ 6 7

Using PQ Trees For Comparative Genomics - CPM Using Our Tool for Various Genome Models  Genome model III (paralogs and orthologs): A gene may appear any number of times in a sequence (including zero). The minimal consensus PQ tree is not necessarily unique. Solution:  Example: consider 2 appearances of the  pattern {a,a,b} as  = { aab, baa }: 1.  = { a 1 a 2 b, ba 2 a 1 } C(T)= { a 1 a 2 b, ba 2 a 1 } 2.  = { a 1 a 2 b, ba 1 a 2 } C(T)= { a 1 a 2 b, ba 2 a 1, a 2 a 1 b, ba 1 a 2 } a1a1 a2a2 b b a1a1 a2a2

Using PQ Trees For Comparative Genomics - CPM It