Learning a hidden graph with adaptive algorithms Hung-Lin Fu Department of Applied Mathematics National Chiao Tung University Hsin Chu, Taiwan
Motivated by bioinformatics applications Introduction
Random shotgun approach genomic segment cut many times at random (Shotgun) 6
Whole-genome shotgun sequencing Short reads are obtained and covering the genome with redundancy and possible gaps. Circular genome Introduction
Reads are assembled into contigs with unknown relative placement. Introduction
Primers : (short) fragments of DNA characterizing ends of contigs. Introduction
A PCR (Polymerase Chain Reaction) reaction reveals if two primers are proximate (adjacent to the same gap). Multiplex PCR can treat multiple primers simultaneously and outputs if there is a pair of adjacent primers in the input set and even sometimes the number of such pairs. Introduction
Two primers of each contig are “mixed together” Find a Hamiltonian cycle by PCRs! Introduction
Primers are treated independently. Find a perfect matching by PCRs. Introduction
Goal Our goal is to provide an experimental protocol that identifies all pairs of adjacent primers with as few PCRs (queries) (or multiplex PCRs respectively) as possible. Introduction
Mathematical Models Hidden Graphs (Reconstructed) Topology-known graphs, e.g. Hamiltonian cycle, matching, star, clique, bipartite graph, …, etc. Graphs of bounded degree Hypergraphs Graphs of known number of edges REF Introduction
Models Multi-vertex model Quantitative multi-vertex model k-vertex model Quantitative k-multi-vertex model Learning a hidden graph by edge-detecting queries: 8
Described into Math Part II Algorithms Adaptive algorithms: a query can depend on the answers obtained by previous queries. Nonadaptive algorithms: queries are independent and can be processed in parallel. Hidden Graph Introduction
Example 3 4 8 7 1 2 5 6 G :
Q({1,2,3,4,5,6,7,8}) = 1 3 4 8 7 1 2 5 6
Q({1,2,3,4}) = 0 3 4 8 7 1 2 5 6
Q({1,2,3,4,5,7}) = 1 3 4 8 7 1 2 5 6
3 4 8 7 1 2 5 6 v = {5}, S \ {v} = {1, 2, 3, 4} Q({1,2,3,4,5}) = 1 v Q({5,1,2}) = 0 Q({5,3}) = 1 5 2 1 4 3 5 2 1 4 3
Known Results (Matching) The information-theoretic lower bound for matching is (1+o(1))nlgn bound can be reached by an adaptive algorithm. [Bouvel, et al. 05’]. Proof. Nonadaptive algorithms require queries. [Alon, Beigel, Kasif, Rudich, Sudakov 02’]. Proof Introduction
Strategy: first to find one vertex Theorem: [Angluin 06’] A vertex in a hidden graph on n vertices can be reconstructed with at most queries. Proof. Introduction
Results Example of Find-One-Vertex Introduction
Known Results on Other Graphs Hamiltonian[lower][upper] Star Introduction
Hamiltonian cycle ~ adap. O(nlgn) bound can be reached by an adaptive algorithm. [Grebinski, Kucherov 1997]. Proof. To process all vertices one-by-one by storing them in the independent set of chains. case I: no/no case II: yes/no case III: yes/yes at most 2nlgn queries. BACK Introduction
How about more general graphs?
Lower bound Theorem 3. For any , edge-detecting queries are required to identify a graph drawn from the class of all graphs with vertices and edges. Proof. 18
Main Ideas If there are edges between two independent sets A and B, we may find all of the edges by using (a, B)-algorithm, a A. We start with finding the maximal matching! Algorithm 1. MAXIMAL_MATCHING(V) Algorithm 2. PARTITION_OF_VERTEX_SET(V) Algorithm 3. HIDDEN_GRAPH(V) 20
Reference Reconstructing a Hamiltonian cycle by querying the graph: Application to DNA physical mapping [Grebinski and Kucherov 98’ ] Learning a hidden Matching [ N. Alon et al, 04’] Learning a hidden graph using O(lgn) queries per edge. [Angluin and Chen 04’] Learning a hidden subgraph [Alon and Asodi, 05’] Combinatorial search on graphs motivated by bioinformatics applications: a brief survey [Bouvel, Grebinski and Kucherov, 05’] Learning a hidden hypergraph [Angluin and Chen, 06’] Math Introduction
Example (Algorithm A(V): Finding an edge on V) 6 8 5 7 2 1 4 3 MAXIMAL_MATCHING(V) Algorithm A({1,2,3,4,5,6,7,8}) 1 3 Algorithm A({2,4,5,6,7,8}) 2 4 Algorithm A({5,6,7,8}) 5 7 Q({8,6}) = 0 21
Algorithm 2 PARTITION_OF_VERTEX_SET(V) 6 8 6 8 G : 6 8 5 7 2 1 4 3 1 3 21
Algorithm 3 It is left to find all the edges between independent sets. Now, a general graph is reconstructed.
Don’t Stop!
Complexity The number of queries is less than 2m(log n + 9). Algorithm 1. Line Number of queries 2 3 total
Algorithm 2. Algorithm 3. Line Number of queries 2 3 total Line 1 7 14+17 0 (all of queries be answered in algorithm 2. , 10th line) 15+18 26 total
Concluding remarks Reduce the rounds of Algorithm 1 (i.e., obtain an efficient algorithm to find a maximal matching). Learning a hidden graph in Quantitative k-multi-vertex model. 24
References [1] N. Alon, R. Beigel, S. Kasif, S. Rudich,and B. Sudakov. Learning a hidden matching, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 197–206, 2002. [2] D. Angluin and J. Chen. Learning a hidden graph using O(log n) queries per edge. Manuscript, 2006. [3] D. Angluin and J. Chen. Learning a hidden hypergraph of Machine Learning Research 7, 2215-2236, 2007. [4] R. Beigel, N. Alon, S. Kasif, M. S. Apaydin and L. Fortnow. An optimal procedure for gap closing in whole genome shotgun sequencing, In RECOMB, 22–30, 2001. [5] V. Grebinski and G. Kucherov. Optimal query bounds for reconstructing a Hamiltonian cycle in complete graphs, In fifth Israel symposium on the Theory of Computing Systems, 166-173, 1997. [6] V. Grebinski and G. Kucherov. Reconstructing a Hamiltonian cycle by querying the graph: Application to DNA physical mapping. Discrete Applied Math., 88(1-3): 147–165, 1998. 25
Thank you for your attention! Introduction