Fingerprint Clustering - CPM 20061 Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri.

Slides:



Advertisements
Similar presentations
Improved Approximation for Orienting Mixed Graphs Iftah Gamzu CS Division, The Open Univ., and CS School, Tel-Aviv University Moti Medina EE School, Tel-Aviv.
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Approximation Algorithms
Generalization and Specialization of Kernelization Daniel Lokshtanov.
Approximative Kernelization: On the Trade-off between Fidelity and Kernel Size joint with Michael Fellows and Frances Rosamond Charles Darwin University.
The Theory of NP-Completeness
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Combinatorial Algorithms
CS774. Markov Random Field : Theory and Application Lecture 17 Kyomin Jung KAIST Nov
PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.
The Stackelberg Minimum Spanning Tree Game Jean Cardinal · Erik D. Demaine · Samuel Fiorini · Gwenaël Joret · Stefan Langerman · Ilan Newman · OrenWeimann.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Optimization problems INSTANCE FEASIBLE SOLUTIONS COST.
Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.
Linear Programming and Parameterized Algorithms. Linear Programming n real-valued variables, x 1, x 2, …, x n. Linear objective function. Linear (in)equality.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
Finding a maximum independent set in a sparse random graph Uriel Feige and Eran Ofek.
Accelerating Simulated Annealing for the Permanent and Combinatorial Counting Problems.
Priority Models Sashka Davis University of California, San Diego June 1, 2003.
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Outline Introduction The hardness result The approximation algorithm.
1 Distributed Computing Optical networks: switching cost and traffic grooming Shmuel Zaks ©
Gene expression & Clustering (Chapter 10)
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Fixed Parameter Complexity Algorithms and Networks.
Simple and Improved Parameterized Algorithms for Multiterminal Cuts Mingyu Xiao The Chinese University of Hong Kong Hong Kong SAR, CHINA CSR 2008 Presentation,
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Approximation Algorithms for NP-hard Combinatorial Problems Magnús M. Halldórsson Reykjavik University
APPROXIMATION ALGORITHMS VERTEX COVER – MAX CUT PROBLEMS
Design Techniques for Approximation Algorithms and Approximation Classes.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
1 Bart Jansen Independent Set Kernelization for a Refined Parameter: Upper and Lower bounds TACO Day, Utrecht January 12 th, 2011 Joint work with Hans.
Chapter 15 Approximation Algorithm Introduction Basic Definition Difference Bounds Relative Performance Bounds Polynomial approximation Schemes Fully Polynomial.
Approximation Algorithms
Speaker: Yoni Rozenshein Instructor: Prof. Zeev Nutov.
1 Bart Jansen Independent Set Kernelization for a Refined Parameter: Upper and Lower bounds ALGORITMe Staff Colloquium, Utrecht September 10 th, 2010 Joint.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
CSCI 3160 Design and Analysis of Algorithms Chengyu Lin.
Algorithmic Mechanism Design: an Introduction Approximate (one-parameter) mechanisms, with an application to combinatorial auctions Guido Proietti Dipartimento.
Approximation Algorithms for NP-hard Combinatorial Problems Magnús M. Halldórsson Reykjavik University Local Search, Greedy and Partitioning
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Experimenting an approximation algorithm for the LCS Paola Bonizzoni, Gianluca Della Vedova., Giancarlo Mauri Discrete Applied Mathematics 110 (2001) 13–24.
Fixed parameter algorithms for protein similarity search under mRNA structure constrains A joint work by: G. Blin, G. Fertin, D. Hermelin, and S. Vialette.
The full Steiner tree problem Theoretical Computer Science 306 (2003) C. L. Lu, C. Y. Tang, R. C. T. Lee Reporter: Cheng-Chung Li 2004/06/28.
Vasilis Syrgkanis Cornell University
Probabilistic Equational Reasoning Arthur Kantor
Probabilistic Equational Reasoning Arthur Kantor
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
IV Latin-American Algorithms, Graphs and Optimization Symposium Puerto Varas - Chile The Generalized Max-Controlled Set Problem Carlos A. Martinhon.
Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.
1 The Theory of NP-Completeness 2 Review: Finding lower bound by problem transformation Problem X reduces to problem Y (X  Y ) iff X can be solved by.
Coverage Approximation Algorithms
Lecture 21 More Approximation Algorithms
Ch09 _2 Approximation algorithm
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Dominating Set By Eric Wengert.
Presentation transcript:

Fingerprint Clustering - CPM Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy

Fingerprint Clustering - CPM Talk Outline Biological problem and combinatorial problem Three versions of the problem: –Clustering with Missing Value (CMV) –Inside Edge Clustering (IEC) –Outside Edge Clustering (OEC) Approximation algorithm for IEC and OEC Polynomial time algorithm for restricted CMV APX-hardness of CMV APX-hardness of IEC and OEC Future work

Fingerprint Clustering - CPM Biological Motivations Classification of microorganisms: A library of rDNA (ribosomal RNA clones) is created A short DNA sequence (a probe) is applied to hybridize with all clones of the library After hybridization unbounded probes are removed; the library is analyzed to see how much any probe is hybridized to each spot Experiment repeated for a set of probes

Fingerprint Clustering - CPM Biological Motivations Fingerprint of a clone: vector consisting of the hybridization intensity values between the clone and each probe To classify microorganisms: Fingerprints are transformed in binary vectors Clustering of fingerprints to infer different properties with respect to the probes

Fingerprint Clustering - CPM Biological Motivations Goal: translate hybridization intensity values into binary values 0, 1. Due to the intensity values it is not always possible to get binary vectors For each clone we are given a fingerprint over alphabet {0,1,N} 0 → no hybridization 1 → hybridization N → unable to determine if a hybridization has happened

Fingerprint Clustering - CPM Clustering of fingerprints – Combinatorial problem Two fingerprints are compatible iff they agree in each position where they are different from N Example: Two compatible fingerprints: N N N N Two uncompatible fingerprints: N N N N

Fingerprint Clustering - CPM Clustering of fingerprints – Combinatorial problem Clustering of fingerprints: general formulation Input: a set F of fingerprints Output: clustering (partition) C of fingerprints such that each cluster of C contains only compatible fingerprints

Fingerprint Clustering - CPM Clustering of fingerprints – Combinatorial problem An example F: f 1 = N f 2 = 0 N 0 1 f 3 = N f 4 = 1 N N 1 Compatibility: f 1 and f 2 ; f 1 and f 3 Some possible solutions: –(f 1 = 010N, f 2 = 0N01), (f 3 = N100), (f 4 = 1NN1) –(f 1 = 010N, f 3 = N100), (f 2 = 0N01), (f 4 = 1NN1)

Fingerprint Clustering - CPM Clustering of fingerprints – Three versions of the problem Three combinatorial versions of the problem with different objective functions CMV (Clustering with Missing Values): minimize the number of clusters IEC (Inside Edge Clustering with missing values): maximize the number of co-clustered pairs of fingerprints OEC (Outside Edge Clustering with missing values): minimize the number of pairs of compatible fingerprints assigned to different clusters

Fingerprint Clustering - CPM CMV- An example CMV: minimize number of clusters F = {f 1 = 01NN, f 2 = 0NN1, f 3 = 0N00, f 4 = 00N1} Compatibility: f 1 compatible with f 2, f 1 compatible with f 3, f 2 compatible with f 4 –A solution: (f 1 = 01NN, f 2 = 0NN1), (f 3 = 0N00), (f 4 = 00N1) → size 3 –Optimum: (f 1 = 01NN, f 3 = 0N00), (f 2 = 0NN1, f 4 = 00N1) → size 2

Fingerprint Clustering - CPM IEC- An example IEC: maximize the number of co-clustered pairs F = {f 1 = 01NN, f 2 = 0NN1, f 3 = 0N00, f 4 = 00N1} Compatibility: f 1 compatible with f 2, f 1 compatible with f 3, f 2 compatible with f 4 –A solution: (f 1 = 01NN, f 2 = 0NN1), (f 3 = 0N00), (f 4 = 00N1) → size 1: pair (f 1,f 2 ) co-clustered –Optimum: (f 1 = 01NN, f 3 = 0N00), (f 2 = 0NN1, f 4 = 00N1) → size 2: pairs (f 1,f 3 ) and (f 2,f 4 ) co-clustered

Fingerprint Clustering - CPM OEC- An example OEC: minimize the number of compatible not co-clustered pairs F = {f 1 = 01NN, f 2 = 0NN1, f 3 = 0N00, f 4 = 00N1} Compatibility: f 1 compatible with f 2, f 1 compatible with f 3, f 2 compatible with f 4 –A solution: (f 1 = 01NN, f 2 = 0NN1), (f 3 = 0N00), (f 4 = 00N1) → size 2; pair (f 1,f 3 ) and (f 2,f 4 ) not co-clustered –Optimum: (f 1 = 01NN, f 3 = 0N00), (f 2 = 0NN1, f 4 = 00N1) → size 1; pair (f 1,f 2 ) not co-clustered

Fingerprint Clustering - CPM Parameterized versions We consider parameterized versions of the problem: number of N’s is our parameter p CMV(p), IEC(p), OEC(p) when fingerprints have at most p positions with value N.

Fingerprint Clustering - CPM Parameterized versions Resolution of a fingerprint f: a vector over {0,1} that is compatible with f Example: f = 01NN10 Possible resolutions:    

Fingerprint Clustering - CPM Parameterized versions For each fingerprint with p N’s: 2 p possible resolutions Reformulation of the problem: given a set of fingerprints and the corresponding set S of resolved vectors, assign each fingerprint f to exactly one of its resolutions in S in order to optimize the objective function

Fingerprint Clustering - CPM Previous results CMV(p): NP-hard for p ≥ 2 [Figueroa et al., CATS 2005] Poly-time for p = 1 [Figueroa et al., J of Comp. Biology 2004] Approximation algorithm with factor min(1 + log n, 2 + p log l) [Figueroa et al., CATS 2005] IEC(p): Approximation algorithm with factor 2 2p−1 [Figueroa et al., CATS 2005] for any p =O(log n) OEC(p) Approximation algorithm with factor 2(1-1/2p) for restricted instances [Figueroa et al., CATS 2005]

Fingerprint Clustering - CPM Approximation algorithm for OEC(p) and IEC(p) Greedy Algorithm: WHILE (there exists a not assigned fingerprint) 1. select a resolved vector that resolves the maximum number of fingerprints 2. Delete the assigned fingerprints ENDWHILE 2-factor approximation ratio for OEC ½ -factor approximation ratio for IEC

Fingerprint Clustering - CPM A tight example for IEC f 1 = N001; f 2 = 0N01; f 3 = 01N1; f 4 = 011N; f 1 compatible with f 2, f 2 compatible with f 3, f 3 compatible with f 4 Resolved vectors associated with compatibility r 12 = 0001; r 23 = 0100; r 34 = 0111 Each of these resolved vectors resolves two fingerprints

Fingerprint Clustering - CPM A tight example for IEC The algorithm chooses one resolved vector, for example r 23 ; f 2 and f 3 are assigned to r 23 and deleted; r 12 is chosen, f 1 is assigned to it and deleted; r 34 is chosen and f 4 is assigned to it and deleted; Number of compatible co-clustered pairs: 1 The optimal solution consists of: r 12 ; f 1 and f 2 are assigned to r 12 ; r 34 ; f 3 and f 4 are assigned to r 34 ; Number of compatible co-clustered pairs in the optimal solution: 2

Fingerprint Clustering - CPM A Polynomial Time Algorithm for Restricted CMV Restricted CMV for each position j there is at most one fingerprint having a value N in j-th position An instance of restricted CMV f 1 = NN ; f 2 = 01 NN 01 01; f 3 = NN 01; f 4 = NN

Fingerprint Clustering - CPM A Polynomial Time Algorithm for Restricted CMV Two interesting properties of restricted CMV: 1. the interesting resolved vectors are at most n 2 (interesting resolved vectors: resolve more than one fingerprint); 2. there is a fingerprint (private fingerprint) which is resolved by one interesting resolved vector; The algorithm at each step selects the interesting resolved vector that resolves a private fingerprint

Fingerprint Clustering - CPM APX-hardness of CMV(2) L-reduction from MIN Vertex Cover on cubic graphs (APX- hard [Alimonti et., TCS 2000]) G=(V, E) cubic graph → graph gadget G A =(V A, E A ) For each v i in V define the following gadget GV i Two possible vertex cover of the gadget: type 1: suboptimal type 2: optimal GV i

Fingerprint Clustering - CPM APX-hardness of CMV(2) G=(V, E) cubic graph to graph gadget G A =(V A, E A ) For each edge (v i, v j ) in E define the edge gadget EG ij 1.Four vertices covered in EG ij → GV i and GV j both optimal 2.Two vertices covered in EG ij → GV i or GV j suboptimal Case 2 is always better than case 1 GV i GV j EG ij

Fingerprint Clustering - CPM APX-hardness of CMV(2) Instance of CMV(2) is built as follows: a resolved vector is built for each vertex of the gadgets a fingerprint is built for each edge of the gadgets two fingerprints share a common resolution iff they are incident on a common vertex

Fingerprint Clustering - CPM APX-hardness of IEC(2) and OEC(2) L-reduction from MAX Independent Set on cubic graphs (APX-hard [Alimonti et., TCS 2000]) Similar to the reduction for CMV(2) G=(V,E) a cubic graph; –for each vertex v i in V a set F i of 9 fingerprints –for each edge (v i, v j ) a fingerprint f ij

Fingerprint Clustering - CPM Open Problems Approximation of CMV(p): –constant factor not dependant on p? –improve min(1 + log n, 2 + p log l) approximation factor Approximation of IEC(p) and OEC(p): –improve approximation factors ½ and 2 Restricted versions of IEC and OEC are in P?

Fingerprint Clustering - CPM Conclusions Biological problem and combinatorial problem Three versions –Clustering with Missing Value (CMV) –Inside Edge Clustering (IEC) –Outside Edge Clustering (OEC) Approximation algorithms for IEC(p) and OEC(p) Polynomial time algorithm for restricted CMV APX-hardness of CMV(2) APX-hardness of IEC(2) and OEC(2) Future work