The computation of hitting sets: Review and new algorithms

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Triangle partition problem Jian Li Sep,2005.  Proposed by Redstar in Algorithm board in Fudan BBS.  Motivated by some network design strategy.
WSPD Applications.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Consistency-Based Diagnosis
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Positive and Negative Relationships Chapter 5, from D. Easley and J. Kleinberg book.
Greedy Algorithms Greed is good. (Some of the time)
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.
© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Approximation Algorithms
CS 104 Introduction to Computer Science and Graphics Problems Data Structure & Algorithms (4) Data Structures 11/18/2008 Yang Song.
CS2420: Lecture 27 Vladimir Kulyukin Computer Science Department Utah State University.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Introduction to Model- Based Diagnosis Meir Kalech Partially based on the slides of Peter Struss.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Data Structures — Lists and Trees CS-2301, B-Term Data Structures — Lists and Trees CS-2301, System Programming for Non-Majors (Slides include materials.
CS355 - Theory of Computation Lecture 2: Mathematical Preliminaries.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Multiway Trees. Trees with possibly more than two branches at each node are know as Multiway trees. 1. Orchards, Trees, and Binary Trees 2. Lexicographic.
1 Trees A tree is a data structure used to represent different kinds of data and help solve a number of algorithmic problems Game trees (i.e., chess ),
Discrete Structures Trees (Ch. 11)
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 3 - Search.
THEORY OF COMPUTATION Komate AMPHAWAN 1. 2.
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
1 Consistent-based Diagnosis Yuhong YAN NRC-IIT. 2 Main concepts in this paper  (Minimal) Diagnosis  Conflict Set  Proposition 3.3  Corollary 4.5.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Extracting Minimum Unsatisfiable Cores with a Greedy Genetic Algorithm Jianmin Zhang, Sikun Li, and Shengyu Shen School of Computer Science, National University.
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Chapter 4: Induction and Recursion
The NP class. NP-completeness
P & NP.
Mathematical Foundations of AI
BCS2143 – Theory of Computer Science
Chapter 5 : Trees.
Redraw these graphs so that none of the line intersect except at the vertices B C D E F G H.
Formal Modeling Concepts
Polynomial-Time Reduction
Induction and Recursion
PC trees and Circular One Arrangements
Computational Molecular Biology
Computer Science Department
Trees 4 The B-Tree Section 4.7
Chapter 5. Optimal Matchings
CSE373: Data Structures & Algorithms Lecture 10: Disjoint Sets and the Union-Find ADT Linda Shapiro Spring 2016.
Lecture 2 Introduction to Computer Science (continued)
James B. Orlin Presented by Tal Kaminker
Discrete Mathematics for Computer Science
ICS 353: Design and Analysis of Algorithms
Lectures on Graph Algorithms: searching, testing and sorting
Outline This topic covers Prim’s algorithm:
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 8 NP and Computational Intractability
ICS 353: Design and Analysis of Algorithms
Sungho Kang Yonsei University
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Graphs and Algorithms (2MMD30)
CSE 6408 Advanced Algorithms.
CS154, Lecture 16: More NP-Complete Problems; PCPs
Clustering.
Approximation Algorithms for the Selection of Robust Tag SNPs
Instructor: Aaron Roth
Switching Lemmas and Proof Complexity
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Heaps.
Presentation transcript:

The computation of hitting sets: Review and new algorithms Li Lin , Yunfei Jiang Department of Mathematics, Jinan University, Guangzhou, PR China Institute of Computer Software, Sun Yat-sen University, Guangzhou, PR China From Information Processing Letters 86 (2003) 報告人:張家榮

Overview Introduction Backgrounds BHS-tree Boolean algorithm Empirical results Some ideas

Minimal hitting set First used by R. Reiter 1987 in “A theory of diagnosis from first principles, Artificial Intelligence ” Can be used to solve minimum set cover problem, diagnosis problem and teachers and courses problem.

Minimal hitting set Given a collection C={Si │i ∈ N} of sets of elements from some universe U, a hitting set is a set S µ U such that S  Si  ® for all i. “Minimal” means no element in S can be deleted. For example, C={ {M1, M2, A1}, {M1, A1, A2, M3}}. The minimal hitting sets are {M1}, {A1}, {M2, A2}, {M2, M3}.

Minimum set cover problem Definition: A set of sets whose union has all members of the union of all sets. The set cover problem is to find a minimum size set. Formal Definition: Given a set S of sets, choose c ⊆ S such that∪ c = ∪ S.

Reduced to MHS a b c d e 1 2 1 2 3 3 2 3 5 4 4 5 5 1 2 3 4 5 e c a d a

Diagnosis problem Some components of a system may cease to operate as designed and cause the discrepancy between the expected behavior of the system and the observed behavior

Diagnosis problem (cont.) (1) compute the collection of all minimal conflict sets (2) transform the conflict sets into diagnoses. A minimal conflict set is a minimal set of components, such that the assumption that each of these components is behaving correctly is inconsistent with the system description and the observation.

How? (1)Conflict sets Not my business (2)Diagnoses Using minimal hitting set

Overview Introduction Backgrounds BHS-tree Boolean algorithm Empirical results Some ideas

Backgrounds (R. Reiter,1987) Hitting sets can be computed by HS-trees. Problem: the size of the tree grows exponentially Minimal hitting sets can be efficiently obtained by a pruned HS-tree. Problem: If the construction of the tree starts with an unfavorable set, the hitting sets may be deleted by pruning;

Backgrounds (cont.) Greiner (1989) has revised the HS-tree into an HS-DAG in which the minimal hitting sets are not be deleted. Separately, Reggia (1983) found the relationship between set-covering problem and hitting sets Haenni (1997) found that the inversions of the hypergraph are the minimal hitting sets.

Backgrounds (cont.) Vinterbo (2000) presented approximate hitting sets, He also used genetic algorithms. Wotawa (2001) presented the HST-tree algorithm. The implementation of the algorithm can be done in a straightforward way, so an efficient implementation is possible.

Overview Introduction Backgrounds BHS-tree Boolean algorithm Empirical results Some ideas

Notations Minimal set cluster MCS = {C1,C2,…Cn} Each node is a tuple (C,H), where C and H are set clusters. The root node is (MCS, {}). The left and right children of a node are denoted by (Cl ,Hl ) and (Cr ,Hr ), respectively. function μ is used for the deletion of non-minimal conflict/hitting sets

BHS-tree The tree is defined recursively as follows. if C = {}, then the BHS-tree is empty else select any element a ∈ ∪ Ci, (Cl = {Ci -{a} | a ∈ Ci }, Hl = {a}) and (Cr = {Ci | a /∈ Ci }, Hr = {}).

Example

Algorithm Step 1. If a node is leaf node, then MHS of this node is H; else run Steps 2 and 3 recursively. Step 2. Replace every parent node H with { H ,{ ml ∪ mr | ml ∈ Hl , mr ∈ Hr } }. Step 3. Minimize H at the root node with the functionμ until it comprises all minimal hitting sets

Online? When a new measurement is added to the conflict sets it is not necessary to compute again the old conflict sets, but it is only necessary to add a new branch to the BHS-tree.

Add conflict set <5, 6>

Overview Introduction Backgrounds BHS-tree Boolean algorithm Empirical results Some ideas

Boolean algorithm The conflict sets CS are presented as CNFs where each atom is negative. Example Suppose CS = {C1,C2, . . . ,Cm} is a conflict set cluster, where Ci = {ei1, ei2, . . . ,ein}, Conflict-set Boolean formula (CSF)= ~e11 ~e12 …~e1n1 + ~e21 ~e22 …~e2n2 +…+ ~em1 ~em2 …~emnm

Boolean algorithm(cont.) A hitting set H is presented as a disjunction of its elements. Example H={h1,h2, . . . ,hn} Hitting-set Boolean formula (HF)=… h1h2. …hn Why? -Theorem 1 If H is a HS of CS, CSF · HF=0

Definition C is a boolean formula H(C) function is defined recursively H(0)=1,H(1)=0 H( ~e )=e H( ~e · C ) = e + H( C ) H( ~e + C ) = e · H( C ) else 5) H( C ) = e ·H(C1)+H(C2) where C1 ⊆ C and ~e /∈ C1 and C2 ={c | c ∪ {~e} ∈ C} ∪C1

Theorem 2 Suppose CSF is a Boolean formula of CS, then H(CSF) is a Boolean formula of HS of CS The proof can be done by using mathematical introduction over the size of CS, k=|CS|

Proof of theorem 2 1)..4) are straightforward Suppose that, when k · n, situation (5) is proved. Now take k = n + 1. For an arbitrary element e ∈ ∪ S∈CS S, H(CSF) = e ·H(CSF1)+H(CSF2), while CSF1 and CSF2 are as defined.

Proof of theorem 2(cont.) By assumption, H(CSF1) is HS of CS1. CS1 ⊆ CS and ~e /∈ CS1,so e ·H(CSF1) must be HS of CS. Obviously, H(CSF2) is HS of CS. So, H(CSF) must be HS of CS Minimum property can be proved by Boolean absorption properties.

Overview Introduction Backgrounds BHS-tree Boolean algorithm Empirical results Some ideas

Empirical results In ANSI C (UNIX) (SGI 2200 Origin, CPU 4400 MHz, MIPS R12000 (IP27) processors, main memory 2 GB, OS IRIX 64 Release 6.5.) HS-tree ( □ ) BHS-tree ( + ) Boolean algebraic ( O )

Empirical results (cont.) Boolean algebraic algorithm needs less memory or running time. It is not sensitive to the selection sequence of element Only |CS| and ∪ c∈CS C influence the efficiency.

Difference between BHS-tree and Boolean algebraic (1) BHS-tree: Binary tree structures Boolean algebraic: list structures (2) BHS-tree: Two steps constructing the binary-tree using it to compute the hitting sets recursively. Boolean algebraic: One step

Overview Introduction Backgrounds BHS-tree Boolean algorithm Empirical results Some ideas

Approximation Algorithms for the Selection of Robust Tag SNPs Kui Zhang Ting Chen This talk is about how to handle SNP genotyping with missing data. My name is Yao-Ting Huang and my advisor is Kun-Mao Chao. And we have two coauthors not here today, They are Prof. Zhang and Prof. Chen. Yao-Ting Huang Kun-Mao Chao Dept. Computer Science & Information Engineering, National Taiwan University Dept. Biostatistics, University of Alabama at Birmingham, USA Dept. Biological Sciences, University of Southern California, USA 2019/2/24

Transformation Each SNP can distinguish partial pairs of patterns. S1 can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4). S2 can distinguish (P1, P4), (P2, P4), (P3, P4). S3 S4 S1 S2 To solve this problem, we first take a closer look at the function of each SNP. If we pick SNP 1, we can be sure that we can distnguish patterns 1 and 3, Because they are in different color at this SNP locus. And we formulate this relation into a bipartite graph. SNP 1 can also distinguish patterns 1 and pattern 4. and so on. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) There are pairs of patterns

Observation 1: Tag SNPs The SNPs can form a set of tag SNPs iff each pair of patterns is covered by at least one edge from the SNPs. e.g., S1 and S3 can form a set of tag SNPs. e.g., S1 and S2 can not be tag SNPs. S3 S1 S2 One unanswered question is what kind of SNPs can be tag SNPs. We can easily answer this question by seeing if the bottom nodes in the graph are all covered by edges from them. For example, SNPs 1 and 3 are tag SNPs. And SNPs 1 and 2 are not tag SNPs. Because patterns 1 and 2 are not covered. So we can not distinguish patterns 1 and 2. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is covered by at least one edge

Observation It is the same with the minimal set cover problem . Minimal set cover problem can be transformed to the minimal hitting set problem.

Observation (cont.) CSF = ( ~S3 ~S4+ ~S1 ~S3 + ~S1 ~S2 ~S4 + ~S1 ~S4 + ~S1 ~S2 ~S3 + ~S2 ~S3 ~S4 ) = ~S3 ~S4 + ~S1 ~S3 + ~S1 ~S4 H(CSF) = S3 ·H( ~S1 ~S4 ) +H(~S4 + ~S1 + ~S1 ~S4 ) = S3 · S1 + S3 · S4 + S4 · S1 So, S1 and S3 can form a set of tag SNPs. so can S3 and S3 ,S1 and S4

Observation 2: Missing Data P1 P2 P3 P4 S3 S1 S2 If a SNP is genotyped as missing data, it is the same as the removal of its node and edges. S4 S3 S4 S1 S2 Another important question is what’s the effect of missing data? It is easy to tell by this graph because it’s just like removing the node and edges from the graph. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Suppose S4 is genotyped as missing data

Problem Reformulation S3 S4 S1 S2 To tolerate m missing tag SNPs, we need to find a set of SNPs such that each pair of patterns is covered by (m+1) edges. e.g., We wish to find a set of robust tag SNPs that tolerates 1 missing tag SNP. S4 S3 S1 From the above two observations, we claim that if we wanna tolerate m missing data, We have to guarantee that each bottom node is covered by at least m plus 1 edges. For example, if we wanna tolerate one missing data, SNPs 1 3 and 4 can be robust tag SNPs. Because each node is covered by at least two edges. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is covered by at least two edges

Modification of boolean algorithm H1(0)=1,H1(1)=0, H0( C ) =1 H1( ~e )=e Hi( ~e · C ) = e · Hi-1( C ) + Hi ( C ) Hi ( C ) = 0 if there exists a CS ∈ C which has elements less than I Hi( ~CS+C ) = CS · Hi( C ) if CS has elements equal to I Hi( C ) = e ·Hi-1(C1)+Hi(C2) where C1 ⊆ C and ~e /∈ C1 and C2 ={c | c ∪ {~e} ∈ C} ∪C1

Modification of boolean algorithm (cont.) To tolerate m missing tag SNPs, we can use Hm+1(CSF) Example H2(CSF)= S3 S4 ·H2( ~S1 ~S3 + ~S1 ~S4) = S1 S3 S4 ·H2 (~S1 ~S4) = S1 S3 S4