An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)
Overview 1.Overview of LAD 2.Decomposability -Importance & motivation 3.An index of decomposability -#data vectors needed to extract reliable decomposable structures -Based on probabilistic analyses 4.Numerical experiments 5.Conclusion
Logical Analysis of Data (LAD) Input: Output: discriminant function T: positive examples (the phenomenon occurs) F: negative examples (the phenomenon does not occur) f(x): a logical explanation of the phenomenon For a phenomenon
Example: influenza FeverHeadacheCoughSnivelStomachache : Set of patients having influenza : Set of patients having common cold An example of discriminant functions: 1=Yes, 0=No Discriminant function f (x) represents knowledge “influenza”. One form of knowledge acquisition
Guideline to find a discriminant function Simplicity Explain the structure of the phenomenon {0,1} n space Positive example Negative example We focus on decomposability.
x1x1 x2x2 x3x3 x4x4 x5x5 h(x[S 1 ]) T F Decomposability S 0 {1, 4, 5} h(x[S 1 ]) x 2 x 3 f (x) x 1 x 2 x 4 x 1 x 3 x 4 x 1 x 4 h(x[S 1 ]) decomposable! S 1 {2, 3} f is decomposable f (x) g(x[S 0 ], h(x[S 1 ])) (T, F) is decomposable decomposable discriminant f
Another example: concept of “square” Square f (x 1, x 2, x 3 ) -x 2 the lengths of all edges are equal -x 3 the number of vertices is 4 -x 1 contains a right angle Square f (x 1, x 2, x 3 ) = g(x 1,h(x 2,x 3 )) - h rhombus - x 2 the lengths of all edges are equal - x 3 the number of vertices is 4
The number of data and decomposable structures Case 1: The size of given data is small. –Advantage: Less computational time is needed to find a decomposable structure. –Disadvantage: Decomposable structures easily exist in data (because of less constraints) = Most decomposable structures are deceptive.
The number of data and decomposable structures Case 2: The size of given data is large. –Advantage: Deceptive decomposable structures will not be found. –Disadvantage: More computational time is needed. How many data vectors should be prepared to extract real decomposable structures? Index of decomposability
(T, F) is decomposable conflict graph of (T, F) is bipartite (Boros et al.1994) Overview of our approach Assume that (T, F) is the set of l randomly chosen vectors from {0, 1} n. 1.Compute the probability of an edge to appear in the conflict graph 2.Regard the conflict graph as a random graph Investigate the probability of the conflict graph to be non-bipartite
Conflict graph Conflict graph (T, F) is decomposable conflict graph of (T, F) is bipartite
Probability of an edge to appear in conflict graph There exists a linked pair. A pair of vectors is called linked if
Define a random variable by where edge appears in the conflict graph. We want to compute. There exists a linked pair.
How to compute is easier to compute. 1. Both of 2. They have different values (i.e., 0 and 1) L=|T|+|F| p:q=|T|:|F| M=2 n m=2 |S 0 |
Approximation of By Inclusion and Exclusion Principle,
Random graph In our analysis, is assumed to be the probability of an edge to appear in the conflict graph. Random graph G(N, r) - N: the number of vertices - Each edge e (u, v) appears in G(N, r) with probability r independently
Probability of a random graph to be non-bipartite Y odd : Random variable representing the number of odd cycles in G(N, r) Pr(Y odd 1): Probability that G(N, r) is not bipartite Markov’s inequality The number of sequences of k vertices For sufficiently large N,
Assumptions Our index Probability of an edge to appear in conflict graph Threshold for a random graph to be bipartite or not - probabilities p and q are given by p : q |T| : |F| - conflict graph is a random graph (|S 0 | |S 1 | n)
Our index If, tends to have many deceptive decomposable structures. If tends to have no deceptive decomposable structure.
Numerical Experiments 1.Prepare non-decomposable randomly generated functions and construct 10 for each data size ( ) 2.Check their decomposability Randomly generated data Target functions are not decomposable Dimensions of data are n 10, 20 Two types of data: are biased and not biased
Randomly generated data our index Sampling ratio (%) Ratio of decomposable (T, F)s (%)
Randomly generated data Sampling ratio (%) Ratio of decomposable (T, F)s (%) our index
Breast Cancer in Wisconsin (a.k.a BCW) Already binarized The dimension is n 11 Comparison with randomly generated data with the same n, p and q Real-world data
BCW and randomly generated data BCWRandomly generated data Sampling ratio (%) Ratio of decomposable (T, F)s (%) our index
Discussion and conclusion An index to extract reliable decomposable structures Computational experiments on random & real-world data - proposed index is a good estimate - |S 0 | 1 or |S 1 | 2 threshold behavior is not clear
Future work Analyses on sharpness of the threshold behavior: to know sufficient |T| + |F| to extract reliable decomposable structures Apply similar approach to other classes of Boolean functions |T| |F| #decomposable structures proposed index we want to estimate