Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fingerprint Clustering - CPM 20061 Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri.

Similar presentations


Presentation on theme: "Fingerprint Clustering - CPM 20061 Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri."— Presentation transcript:

1 Fingerprint Clustering - CPM 20061 Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy

2 Fingerprint Clustering - CPM 20062 Talk Outline Biological problem and combinatorial problem Three versions of the problem: –Clustering with Missing Value (CMV) –Inside Edge Clustering (IEC) –Outside Edge Clustering (OEC) Approximation algorithm for IEC and OEC Polynomial time algorithm for restricted CMV APX-hardness of CMV APX-hardness of IEC and OEC Future work

3 Fingerprint Clustering - CPM 20063 Biological Motivations Classification of microorganisms: A library of rDNA (ribosomal RNA clones) is created A short DNA sequence (a probe) is applied to hybridize with all clones of the library After hybridization unbounded probes are removed; the library is analyzed to see how much any probe is hybridized to each spot Experiment repeated for a set of probes

4 Fingerprint Clustering - CPM 20064 Biological Motivations Fingerprint of a clone: vector consisting of the hybridization intensity values between the clone and each probe To classify microorganisms: Fingerprints are transformed in binary vectors Clustering of fingerprints to infer different properties with respect to the probes

5 Fingerprint Clustering - CPM 20065 Biological Motivations Goal: translate hybridization intensity values into binary values 0, 1. Due to the intensity values it is not always possible to get binary vectors For each clone we are given a fingerprint over alphabet {0,1,N} 0 → no hybridization 1 → hybridization N → unable to determine if a hybridization has happened

6 Fingerprint Clustering - CPM 20066 Clustering of fingerprints – Combinatorial problem Two fingerprints are compatible iff they agree in each position where they are different from N Example: Two compatible fingerprints: 0 1 0 N N 0 1 0 0 1 N N 1 0 1 0 Two uncompatible fingerprints: 0 1 0 N N 0 1 0 0 1 N N 1 0 0 0

7 Fingerprint Clustering - CPM 20067 Clustering of fingerprints – Combinatorial problem Clustering of fingerprints: general formulation Input: a set F of fingerprints Output: clustering (partition) C of fingerprints such that each cluster of C contains only compatible fingerprints

8 Fingerprint Clustering - CPM 20068 Clustering of fingerprints – Combinatorial problem An example F: f 1 = 0 1 0 N f 2 = 0 N 0 1 f 3 = N 1 0 0 f 4 = 1 N N 1 Compatibility: f 1 and f 2 ; f 1 and f 3 Some possible solutions: –(f 1 = 010N, f 2 = 0N01), (f 3 = N100), (f 4 = 1NN1) –(f 1 = 010N, f 3 = N100), (f 2 = 0N01), (f 4 = 1NN1)

9 Fingerprint Clustering - CPM 20069 Clustering of fingerprints – Three versions of the problem Three combinatorial versions of the problem with different objective functions CMV (Clustering with Missing Values): minimize the number of clusters IEC (Inside Edge Clustering with missing values): maximize the number of co-clustered pairs of fingerprints OEC (Outside Edge Clustering with missing values): minimize the number of pairs of compatible fingerprints assigned to different clusters

10 Fingerprint Clustering - CPM 200610 CMV- An example CMV: minimize number of clusters F = {f 1 = 01NN, f 2 = 0NN1, f 3 = 0N00, f 4 = 00N1} Compatibility: f 1 compatible with f 2, f 1 compatible with f 3, f 2 compatible with f 4 –A solution: (f 1 = 01NN, f 2 = 0NN1), (f 3 = 0N00), (f 4 = 00N1) → size 3 –Optimum: (f 1 = 01NN, f 3 = 0N00), (f 2 = 0NN1, f 4 = 00N1) → size 2

11 Fingerprint Clustering - CPM 200611 IEC- An example IEC: maximize the number of co-clustered pairs F = {f 1 = 01NN, f 2 = 0NN1, f 3 = 0N00, f 4 = 00N1} Compatibility: f 1 compatible with f 2, f 1 compatible with f 3, f 2 compatible with f 4 –A solution: (f 1 = 01NN, f 2 = 0NN1), (f 3 = 0N00), (f 4 = 00N1) → size 1: pair (f 1,f 2 ) co-clustered –Optimum: (f 1 = 01NN, f 3 = 0N00), (f 2 = 0NN1, f 4 = 00N1) → size 2: pairs (f 1,f 3 ) and (f 2,f 4 ) co-clustered

12 Fingerprint Clustering - CPM 200612 OEC- An example OEC: minimize the number of compatible not co-clustered pairs F = {f 1 = 01NN, f 2 = 0NN1, f 3 = 0N00, f 4 = 00N1} Compatibility: f 1 compatible with f 2, f 1 compatible with f 3, f 2 compatible with f 4 –A solution: (f 1 = 01NN, f 2 = 0NN1), (f 3 = 0N00), (f 4 = 00N1) → size 2; pair (f 1,f 3 ) and (f 2,f 4 ) not co-clustered –Optimum: (f 1 = 01NN, f 3 = 0N00), (f 2 = 0NN1, f 4 = 00N1) → size 1; pair (f 1,f 2 ) not co-clustered

13 Fingerprint Clustering - CPM 200613 Parameterized versions We consider parameterized versions of the problem: number of N’s is our parameter p CMV(p), IEC(p), OEC(p) when fingerprints have at most p positions with value N.

14 Fingerprint Clustering - CPM 200614 Parameterized versions Resolution of a fingerprint f: a vector over {0,1} that is compatible with f Example: f = 01NN10 Possible resolutions:  01 00 10  01 01 10  01 10 10  01 11 10

15 Fingerprint Clustering - CPM 200615 Parameterized versions For each fingerprint with p N’s: 2 p possible resolutions Reformulation of the problem: given a set of fingerprints and the corresponding set S of resolved vectors, assign each fingerprint f to exactly one of its resolutions in S in order to optimize the objective function

16 Fingerprint Clustering - CPM 200616 Previous results CMV(p): NP-hard for p ≥ 2 [Figueroa et al., CATS 2005] Poly-time for p = 1 [Figueroa et al., J of Comp. Biology 2004] Approximation algorithm with factor min(1 + log n, 2 + p log l) [Figueroa et al., CATS 2005] IEC(p): Approximation algorithm with factor 2 2p−1 [Figueroa et al., CATS 2005] for any p =O(log n) OEC(p) Approximation algorithm with factor 2(1-1/2p) for restricted instances [Figueroa et al., CATS 2005]

17 Fingerprint Clustering - CPM 200617 Approximation algorithm for OEC(p) and IEC(p) Greedy Algorithm: WHILE (there exists a not assigned fingerprint) 1. select a resolved vector that resolves the maximum number of fingerprints 2. Delete the assigned fingerprints ENDWHILE 2-factor approximation ratio for OEC ½ -factor approximation ratio for IEC

18 Fingerprint Clustering - CPM 200618 A tight example for IEC f 1 = N001; f 2 = 0N01; f 3 = 01N1; f 4 = 011N; f 1 compatible with f 2, f 2 compatible with f 3, f 3 compatible with f 4 Resolved vectors associated with compatibility r 12 = 0001; r 23 = 0100; r 34 = 0111 Each of these resolved vectors resolves two fingerprints

19 Fingerprint Clustering - CPM 200619 A tight example for IEC The algorithm chooses one resolved vector, for example r 23 ; f 2 and f 3 are assigned to r 23 and deleted; r 12 is chosen, f 1 is assigned to it and deleted; r 34 is chosen and f 4 is assigned to it and deleted; Number of compatible co-clustered pairs: 1 The optimal solution consists of: r 12 ; f 1 and f 2 are assigned to r 12 ; r 34 ; f 3 and f 4 are assigned to r 34 ; Number of compatible co-clustered pairs in the optimal solution: 2

20 Fingerprint Clustering - CPM 200620 A Polynomial Time Algorithm for Restricted CMV Restricted CMV for each position j there is at most one fingerprint having a value N in j-th position An instance of restricted CMV f 1 = NN 01 01 01; f 2 = 01 NN 01 01; f 3 = 01 11 NN 01; f 4 = 01 11 11 NN

21 Fingerprint Clustering - CPM 200621 A Polynomial Time Algorithm for Restricted CMV Two interesting properties of restricted CMV: 1. the interesting resolved vectors are at most n 2 (interesting resolved vectors: resolve more than one fingerprint); 2. there is a fingerprint (private fingerprint) which is resolved by one interesting resolved vector; The algorithm at each step selects the interesting resolved vector that resolves a private fingerprint

22 Fingerprint Clustering - CPM 200622 APX-hardness of CMV(2) L-reduction from MIN Vertex Cover on cubic graphs (APX- hard [Alimonti et., TCS 2000]) G=(V, E) cubic graph → graph gadget G A =(V A, E A ) For each v i in V define the following gadget GV i Two possible vertex cover of the gadget: type 1: suboptimal type 2: optimal GV i

23 Fingerprint Clustering - CPM 200623 APX-hardness of CMV(2) G=(V, E) cubic graph to graph gadget G A =(V A, E A ) For each edge (v i, v j ) in E define the edge gadget EG ij 1.Four vertices covered in EG ij → GV i and GV j both optimal 2.Two vertices covered in EG ij → GV i or GV j suboptimal Case 2 is always better than case 1 GV i GV j EG ij

24 Fingerprint Clustering - CPM 200624 APX-hardness of CMV(2) Instance of CMV(2) is built as follows: a resolved vector is built for each vertex of the gadgets a fingerprint is built for each edge of the gadgets two fingerprints share a common resolution iff they are incident on a common vertex

25 Fingerprint Clustering - CPM 200625 APX-hardness of IEC(2) and OEC(2) L-reduction from MAX Independent Set on cubic graphs (APX-hard [Alimonti et., TCS 2000]) Similar to the reduction for CMV(2) G=(V,E) a cubic graph; –for each vertex v i in V a set F i of 9 fingerprints –for each edge (v i, v j ) a fingerprint f ij

26 Fingerprint Clustering - CPM 200626 Open Problems Approximation of CMV(p): –constant factor not dependant on p? –improve min(1 + log n, 2 + p log l) approximation factor Approximation of IEC(p) and OEC(p): –improve approximation factors ½ and 2 Restricted versions of IEC and OEC are in P?

27 Fingerprint Clustering - CPM 200627 Conclusions Biological problem and combinatorial problem Three versions –Clustering with Missing Value (CMV) –Inside Edge Clustering (IEC) –Outside Edge Clustering (OEC) Approximation algorithms for IEC(p) and OEC(p) Polynomial time algorithm for restricted CMV APX-hardness of CMV(2) APX-hardness of IEC(2) and OEC(2) Future work


Download ppt "Fingerprint Clustering - CPM 20061 Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri."

Similar presentations


Ads by Google