An Algorithm for the Consecutive Ones Property Claudio Eccher.

Slides:



Advertisements
Similar presentations
Based on slides by Y. Peng University of Maryland
Advertisements

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
8.3 Representing Relations Connection Matrices Let R be a relation from A = {a 1, a 2,..., a m } to B = {b 1, b 2,..., b n }. Definition: A n m  n connection.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Planar Orientations Chapter 4 ( ) in the book Written By: Tomer Heber.
Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.
Comp 122, Spring 2004 Greedy Algorithms. greedy - 2 Lin / Devi Comp 122, Fall 2003 Overview  Like dynamic programming, used to solve optimization problems.
1/44 A simple Test For the Consecutive Ones Property.
Introduction to Algorithms Second Edition by Cormen, Leiserson, Rivest & Stein Chapter 22.
NP-complete and NP-hard problems Transitivity of polynomial-time many-one reductions Concept of Completeness and hardness for a complexity class Definition.
Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.
Chapter 23 Minimum Spanning Trees
Applied Discrete Mathematics Week 12: Trees
Connected Components, Directed Graphs, Topological Sort COMP171.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Connected Components, Directed Graphs, Topological Sort Lecture 25 COMP171 Fall 2006.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Connected Components, Directed graphs, Topological sort COMP171 Fall 2005.
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.
ARCHEOLOGICAL SERIATION AND INTERVAL GRAPHS
Physical Mapping II + Perl CIS 667 March 2, 2004.
MATRICES. Matrices A matrix is a rectangular array of objects (usually numbers) arranged in m horizontal rows and n vertical columns. A matrix with m.
Induction and recursion
KNURE, Software department, Ph , N.V. Bilous Faculty of computer sciences Software department, KNURE The trees.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
MAPS OF DNA AND INTERVAL GRAPHS by Akshita Gurram.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
CSC 331: Algorithm Analysis Decompositions of Graphs.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
© by Kenneth H. Rosen, Discrete Mathematics & its Applications, Sixth Edition, Mc Graw-Hill, 2007 Chapter 9 (Part 2): Graphs  Graph Terminology (9.2)
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Characterizing Matrices with Consecutive Ones Property
Based on slides by Y. Peng University of Maryland
Lecture 10 Applications of NP-hardness. Knapsack.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
 2004 SDU Lectrue4-Properties of DFS Properties of DFS Classification of edges Topological sort.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
LIMITATIONS OF ALGORITHM POWER
8.4 Closures of Relations Definition: The closure of a relation R with respect to property P is the relation obtained by adding the minimum number of.
Chapter 13 Backtracking Introduction The 3-coloring problem
Introduction to NP Instructor: Neelima Gupta 1.
1/44 A simple Test For the Consecutive Ones Property Without PC-trees!
Computational Molecular Biology
(CSC 102) Lecture 30 Discrete Structures. Graphs.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
1 GRAPHS – Definitions A graph G = (V, E) consists of –a set of vertices, V, and –a set of edges, E, where each edge is a pair (v,w) s.t. v,w  V Vertices.
Discrete Structures Li Tak Sing( 李德成 ) Lectures
PC trees and Circular One Arrangements
Graph Algorithms Using Depth First Search
Based on slides by Y. Peng University of Maryland
Topological Sort CSE 373 Data Structures Lecture 19.
Lectures on Graph Algorithms: searching, testing and sorting
CS 583 Analysis of Algorithms
Connected Components, Directed Graphs, Topological Sort
CSE 589 Applied Algorithms Spring 1999
Systems of distinct representations
Lecture 2-5 Applications of NP-hardness
Characterizing Matrices with Consecutive Ones Property
Applied Discrete Mathematics Week 13: Graphs
Presentation transcript:

An Algorithm for the Consecutive Ones Property Claudio Eccher

Outline 3.An algorithm for the C1P problem Dividing in components Taking care of a component Joining the components together 2.Biological background Hybridization mapping 1.C1P definition

The consecutive ones property Definition: A binary matrix is said to have the consecutive ones property (C1P) if a permutation of its columns can be found such that all 1s in each row are consecutive ABCD CADB

The consecutive ones property Observation: the C1P is closed under taking submatrices CAD A bad matrix: Whichever column x I put in the middle there is a row in which x is 0 Hence, every matrix containing this submatrix is ‘bad’

Hybridization mapping (1) The possible binding of small sequences (probes) to a clone are checked, the subset of the probes bounded (hybridized) to a clone becomes its fingerprint Clones’ overlap, and thus their relative order, are determined by comparing fingerprints Copies of a DNA molecule are broken into several fragments (~10 4 bases) and replicated by cloning (clones)

Hybridization mapping (2) Clone 1 Clone 2 ADCBProbes Two clones sharing part of their respective fingerprints are likely to have come from overlapping DNA regions

Assumptions All “clones x probes” hybridization experiments have been done There are no errors Probes are unique

Model n x m binary matrix M built from experimental data  M ij = 1  probe j hybridized to clone i  M ij = 0  probe j not hybridized to clone i n clones and m probes

Problem Obtaining a physical map from M Finding a permutation of the columns such that all 1s in each row are consecutive Determing if M has the C1P for rows

An algorithm for the C1P problem The problem belongs to P Without loss of generality we can assume that: All rows are different No row is all zeros The algorithm is from Fulkerson and Gross (1965)

Algorithm sketch Join of the components together Separation of the rows into components (subsets of rows) Permutation of the columns of each component

Row relations Definition:  row i  S i ={columns k | M i,k =1} Given two rows i and j: 1.S i  S j =  or 2.S i  S j or S j  S i or 3.S i  S j   and none of them is a subset of the other

Dividing in components (1) Let’s initially lump together in the same component the rows with non empty intersection If  a row k s.t.: Then row k can be put in its own component S k  S i =  or S k  S i  i  k in this component

Dividing in components (2) A graph G c = (V,E) is built from matrix M Each vertex V is a row of M There is an undirected edge E from V i to V j if S i  S j   and none of them is a subset of the other The components we want are the connected components of G c

Building G c : an example c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 c9c9 l1l l2l l3l l4l l5l l6l l7l l8l l2l2 l1l1  l3l3  l4l4 l5l5  l6l6 l7l7 l8l8  GcGc Edge (l 1, l 2 )

Building G c : an example c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 c9c9 l1l l2l l3l l4l l5l l6l l7l l8l l2l2 l1l1  l3l3  l4l4 l5l5  l6l6 l7l7 l8l8  GcGc Edge (l 4, l 5 )

Building G c : an example c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 c9c9 l1l l2l l3l l4l l5l l6l l7l l8l l2l2 l1l1  l3l3  l4l4 l5l5  l6l6 l7l7 l8l8  GcGc Edge (l 6, l 7 )

Building G c : an example c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 c9c9 l1l l2l l3l l4l l5l l6l l7l l8l l2l2 l1l1  l3l3  l4l4 l5l5  l6l6 l7l7 l8l8  GcGc Edge (l 6, l 8 )

Taking care of a component (1) {2,7,8} l1l1 …01110… {5}{2,7} {8} l1l1 …001110… l2l2 …011100… The 1s of the first row have to be put consecutive. The possible solutions can be represented as follows: The second row is adjacent to the first one. Hence, for the second row (l 2 ) there are 2 choices: the 1s can be placed to the left or to the right of those of the row l 1. In any case the direction does not really matter c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 l1l l2l l3l l1l1 l2l2 l3l3

Taking care of a component (2) c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 l1l l2l l3l For the third row (l 3 ) we have to consider the relations with the rows connected by edges to l 3 l1l1 l2l2 l3l3 Let’s place l 3 with respect to l 2 : we cannot place l 3 in either direction (left or right) because of its relation with l 1 To take into account the relation between l 1 and l 3 is necessary to consider the number of elements in the intersections between S 1, S 2 and S 3

Taking care of a component (3) c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 l1l l2l l3l Definition: Let x·y = | S x  S y | be the internal product of rows x and y l1l1 l2l2 l3l3 If l 1 ·l 3 < min(l 1 ·l 2, l 2 ·l 3 ) then l 3 has to be placed in the same direction that l 2 was placed with respect to l 1 If l 1 ·l 3 > min(l 1 ·l 2, l 2 ·l 3 ) then l 3 has to be placed in the opposite direction that l 2 was placed with respect to l 1 If we have equality it isn’t possible to have the 1s of l 3 consecutive

Taking care of a component (4) For l 3, S 3 = {1,4,7,8}, l 1 ·l 3 = 2, l 1 ·l 2 = 2, l 1 ·l 3 = 1, so l 3 have to be put to the right of l 2 : {5}{2}{7}{8}{1,4} l1l1 … … l2l2 … … l3l3 … … c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 l1l l2l l3l l1l1 l2l2 l3l3

Taking care of a component (5) The only choice made was in the placement of l 2 with respect to l 1 and both possibilities result in the same solutions up to reversal. Therefore, if the component has the C1P, then l 1 and l 3 must result properly placed If, on the contrary, l 1 and l 3 are not properly placed, then we conclude that the component (and hence the matrix) doesn’t have the C1P We had no choice in placing l 3

String generator We have seen the following examples of string generator {2,7,8} {{5}{2,7}{8}} {{5}{2}{7}{8}{1,4}} A permutation p of the probes is compatible with a string generator if whenever A, B, C appear in this order in p and A and C are in a group G, then B is also included in G An invariant of the algorithm is that, after considering rows 1..k, a permutation p certificates the C1P of the submatrix on rows 1..k iff either p or its reversal is compatible with the string generator

Taking care of a component: a ‘bad’ component The relations between the rows are the same as the preceding component {5}{2}{7}{8}{3}{1,4} l1l1 … … l2l2 … … l3l3 … … c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 l1l l2l l3l {5}{2,7}{8, 3} l1l1 …00110… l2l2 …01100…

Taking care of a component (6) For a new row k in the same component find two previously placed rows i and j s.t.  E(k,i), E(i,j) in G c and proceed as for the three-row case. Check also the consistency with the solution generator The algorithm gives all possible permutations of a component having the C1P, up to reversal

Algorithm implementation When visiting a vertex invoke procedure Place If column sets are not consistent then the component doesn’t have the C1P Construct G c and traverse it using depth-first search Algorithm Place input: u, v, w vertices of Gc=(V,E) s.t. (u,v)  E and (v,w)  E output: A placement for row u, if possible if v = nil and w = nil then Place all 1s of u consecutively else if w = nil then Left- or right-place the 1s of u with respect to the 1s of v Record direction used else if u · w < min(u · v, v · w) then Place u with respect to v in the same direction used in v, w placement. Record direction used else Place u with respect to v in the opposite direction used in v, w placement. Record direction used Check consistency of column set

Algorithm running time For a n x m matrix building graph G c takes O(nm) time To check consistency of column sets requires O(m) time per row and there are n rows to process Total time is thus O(nm)

Joining components together (1) Construct a new graph G M = (V,E) in which: Each component  k of M is a vertex in G M For ,   V, there is a directed edge from  to  if  row i   sets S i are contained in at least one set S j of  G M tells us how the components of M fit together

G M for the example matrix c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 c9c9 l1l l2l l3l l4l l5l l6l l7l l8l    GMGM    

Joining components together (2) For two sets S i  , S j  , if S i  S j then there is no row k   s.t. S i  S k and S i  S k  The exact same containments and disjunctions hold for all other sets from  G M is acyclic

Joining components together (3) The joining of components depends on the way sets in one component contain or are contained in sets from other components Components having sets not contained anywhere else should be processed first Containment is specified by the directed edges in G M

Joining components together (4) G M has to be processed in topological order Remove all sources from G M (e.g.  ) and make the union of their string generators While G M is not empty take the next source  remove  from G M, and refine the current string generator with the string generator of 

Example (1) c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 c8c8 c9c9 l1l l2l l3l l4l l5l l6l l7l l8l    GMGM     One topological order is , , , 

Example (2) l1l1 … … l2l2 … … {1}{2,4,5,7,9}{3,6,8} l1l1 …11111… {2,4,5,7,9} l6l6 …00110… l7l7 …00011… l8l8 …11100… l4l4 …011… l5l5 …110… {6}{3}{8} {9,5}{4}{7}{2}    

Example (3) l1l1 … … l2l2 … … l3l3 … … {1}{2,4,5,7,9}{3,6,8} l6l6 …00110… l7l7 …00011… l8l8 …11100… {9,5}{4}{7}{2}

Example (4) l1l1 … … l2l2 … … l3l3 … … l6l6 … … l7l7 … … l8l8 … … {1}{9,5}{4}{7}{2}{3,6,8} l4l4 …011… l5l5 …110… {6}{3}{8}

Example (5) l1l1 … … l2l2 … … l3l3 … … l6l6 … … l7l7 … … l8l8 … … l4l4 … … l5l5 … … {1}{9,5}{4}{7}{2}{6}{3}{8} In this particular case there are two solutions corresponding to the permutation of identical columns (5 and 9)

Algorithm solution is not unique In general multiple solutions may exist because: Each component may on its own have several solutions Each solution can be used in two ways: the permutation and its reversal

Algorithm running time Topological sorting of G M takes time O(n+m) If the entries of M are preprocessed the queries needed for traversing G M can take constant time Preprocessing takes at most O(nm) Total time for processing each component c i is O(n i m) Algorithm running time is O(nm)

Concluding remarks (1) Even if a C1P permutation exists, this is not necessarily the true permutation: In general errors do exist, so the true permutation is not the C1P one The solution is not unique

Concluding remarks (2) Generalizations to account for errors yield NP- hard problems Also relaxing the assumption of unique probes yields NP-hard problems

Related works A considerably more complicated algorithm from Booth and Leuker exists (1976) that takes O(n+m+r) time (r is the total number of 1s) Quite recently a simple O(n+m+r)-time algorithm has been presented by Hsu - J Algorithms 43 (2002), no. 1, 1-16