Study of Bayesian network classifier Huang Kaizhu Huang Kaizhu Supervisors: Prof. Irwin King Supervisors: Prof. Irwin King Prof. Lyu Rung Tsong Michael Prof. Lyu Rung Tsong Michael Markers: Prof. Chan Lai Wan Prof. Wong Kin Hong Prof. Wong Kin Hong
OutlineOutline BackgroundBackground –What is Bayesian network? –How Bayesian networks can be used as classifiers? –Why choose Bayesian network? –What is problem of Learning Bayesian network ? My main worksMy main works –Large-Node Chow-Liu-tree –Maximum likelihood Large-Node-Bounded semi-Naïve BN Future workFuture work ConclusionConclusion BackgroundBackground –What is Bayesian network? –How Bayesian networks can be used as classifiers? –Why choose Bayesian network? –What is problem of Learning Bayesian network ? My main worksMy main works –Large-Node Chow-Liu-tree –Maximum likelihood Large-Node-Bounded semi-Naïve BN Future workFuture work ConclusionConclusion
BackgroundBackground What is Bayesian Network(BN)?What is Bayesian Network(BN)? –Composed of a “structure” component G and a “parameter” component . –G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E. And the nodes represent the attributes, the edges between the nodes represent the dependence relationship between the nodes. – is a conditional probability table. –It encodes the following joint probability among the nodes (X1,X2,…,Xn): What is Bayesian Network(BN)?What is Bayesian Network(BN)? –Composed of a “structure” component G and a “parameter” component . –G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E. And the nodes represent the attributes, the edges between the nodes represent the dependence relationship between the nodes. – is a conditional probability table. –It encodes the following joint probability among the nodes (X1,X2,…,Xn):
Background(con’t)Background(con’t) The Bayesian network above encodes the following probability relationship. P(F, B, L, D, H) =P(F) P(B)P(L |F) P(D |F,B)P(H|D) The Bayesian network above encodes the following probability relationship. P(F, B, L, D, H) =P(F) P(B)P(L |F) P(D |F,B)P(H|D) Structure component Parameter component Example of Bayesian network:
Background (con’t) How can BN be a classifier?How can BN be a classifier? –Firstly use BN to model the dataset –Then use the distribution encoded in BN to do classification How can BN be a classifier?How can BN be a classifier? –Firstly use BN to model the dataset –Then use the distribution encoded in BN to do classification
Background (con’t) Why choose Bayesian network?Why choose Bayesian network? –Bayesian network represents some inner relationship between the attributes –The joint probability based on BN can be written as a decomposable form Why choose Bayesian network?Why choose Bayesian network? –Bayesian network represents some inner relationship between the attributes –The joint probability based on BN can be written as a decomposable form
Background (con’t) What is the problem of learning Bayesian network?What is the problem of learning Bayesian network? –Given a training data set D={u 1, u 2, u 3 …u N } of instances of attributes U, find a network B that best matches D. What’s the difficulty in learning BN?What’s the difficulty in learning BN? –Generally speaking, BN optimization problem is intractable. Two ApproachesTwo Approaches –Either we constrain the searching in a certain restricted class of networks (Naïve BN, Semi-Naïve BN, CL-tree etc) Q1: Are these restricted class enough to represent the data? Q1: Are these restricted class enough to represent the data? –Or we adopt some heuristic methods on general networks (K2 etc) Q2: Are the heuristic methods on general network efficient ?Q2: Are the heuristic methods on general network efficient ? Q3: Are the heuristic methods on general network redundant to represent the data? Q3: Are the heuristic methods on general network redundant to represent the data? What is the problem of learning Bayesian network?What is the problem of learning Bayesian network? –Given a training data set D={u 1, u 2, u 3 …u N } of instances of attributes U, find a network B that best matches D. What’s the difficulty in learning BN?What’s the difficulty in learning BN? –Generally speaking, BN optimization problem is intractable. Two ApproachesTwo Approaches –Either we constrain the searching in a certain restricted class of networks (Naïve BN, Semi-Naïve BN, CL-tree etc) Q1: Are these restricted class enough to represent the data? Q1: Are these restricted class enough to represent the data? –Or we adopt some heuristic methods on general networks (K2 etc) Q2: Are the heuristic methods on general network efficient ?Q2: Are the heuristic methods on general network efficient ? Q3: Are the heuristic methods on general network redundant to represent the data? Q3: Are the heuristic methods on general network redundant to represent the data?
Background (con’t) Problems in two approachesProblems in two approaches –Q1: Are these restricted class enough to represent the data? No, in many cases, they are really too limited in expression ability to model the data.No, in many cases, they are really too limited in expression ability to model the data. –Q2: Are the heuristic methods on general network efficient ? No, they have a big search space, which will be greatly time- consumingNo, they have a big search space, which will be greatly time- consuming –Q3: is it possible that general networks obtained by the heuristic methods are redundant to represent the data? Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem. Problems in two approachesProblems in two approaches –Q1: Are these restricted class enough to represent the data? No, in many cases, they are really too limited in expression ability to model the data.No, in many cases, they are really too limited in expression ability to model the data. –Q2: Are the heuristic methods on general network efficient ? No, they have a big search space, which will be greatly time- consumingNo, they have a big search space, which will be greatly time- consuming –Q3: is it possible that general networks obtained by the heuristic methods are redundant to represent the data? Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.
Possible solutions –Upgrading Solution How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure?How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure? –Bound Solution Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter. –Upgrading Solution How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure?How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure? –Bound Solution Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.
Work1:Large node Chow-Liu tree Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT)Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT) –What is the restriction of CLT? CLT restricts the network in a tree structure among the variablesCLT restricts the network in a tree structure among the variables –Shortcomings caused by the restriction CLT can not represent many dataset with a non-tree underlying structure.CLT can not represent many dataset with a non-tree underlying structure. –Observations: A “large node tree” may partiallyA “large node tree” may partially solve this shortcoming. solve this shortcoming. –Example: Right figureRight figure Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT)Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT) –What is the restriction of CLT? CLT restricts the network in a tree structure among the variablesCLT restricts the network in a tree structure among the variables –Shortcomings caused by the restriction CLT can not represent many dataset with a non-tree underlying structure.CLT can not represent many dataset with a non-tree underlying structure. –Observations: A “large node tree” may partiallyA “large node tree” may partially solve this shortcoming. solve this shortcoming. –Example: Right figureRight figure
Work1:Large node Chow-Liu tree Large-node, which is a combination of several nodes, may partially relax the tree restriction.Large-node, which is a combination of several nodes, may partially relax the tree restriction. In forming a large node,There are two requirements. In forming a large node,There are two requirements. –Requirement 1 Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other.Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other. –Requirement 2 Large-node can not be too “large” or the probability estimation of this large node will not be not reliableLarge-node can not be too “large” or the probability estimation of this large node will not be not reliable –An extreme situation is that we combine all the nodes into a large node. This situation will lost all the advantages of Bayesian network. Large-node, which is a combination of several nodes, may partially relax the tree restriction.Large-node, which is a combination of several nodes, may partially relax the tree restriction. In forming a large node,There are two requirements. In forming a large node,There are two requirements. –Requirement 1 Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other.Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other. –Requirement 2 Large-node can not be too “large” or the probability estimation of this large node will not be not reliableLarge-node can not be too “large” or the probability estimation of this large node will not be not reliable –An extreme situation is that we combine all the nodes into a large node. This situation will lost all the advantages of Bayesian network.
Upgrade CLT into Large-Node-CLT A bounded Frequent itemset can satisfy the Requirement 1 & 2A bounded Frequent itemset can satisfy the Requirement 1 & 2 –What is Frequent itemset? It is the set of attributes that come together with each other frequently.It is the set of attributes that come together with each other frequently. Example: Food store---{bread}, {button}, {bread, button}Example: Food store---{bread}, {button}, {bread, button} –An frequent itemset with high frequency is more like a “large node”. ---Requirement 1 ---Requirement 1 –We restrict that the the number of nodes involved in a large node is no greater than a K threshold ---Requirement 2 ---Requirement 2 –Frequent itemset can be obtained according to the algorithm Apriori in [AS1994] A bounded Frequent itemset can satisfy the Requirement 1 & 2A bounded Frequent itemset can satisfy the Requirement 1 & 2 –What is Frequent itemset? It is the set of attributes that come together with each other frequently.It is the set of attributes that come together with each other frequently. Example: Food store---{bread}, {button}, {bread, button}Example: Food store---{bread}, {button}, {bread, button} –An frequent itemset with high frequency is more like a “large node”. ---Requirement 1 ---Requirement 1 –We restrict that the the number of nodes involved in a large node is no greater than a K threshold ---Requirement 2 ---Requirement 2 –Frequent itemset can be obtained according to the algorithm Apriori in [AS1994]
Upgrade CLT into Large-Node-CLT The construction algorithmThe construction algorithm 1. Call Apriori[AS94] to generate the frequent itemsets, which have the size less than k. Record all the frequent itemsets together with their supports into list L. 2.Draft the CL-tree of the dataset according to the CLT algorithm 3.Until L is null 4.Iteratively combine the frequent itemsets which satisfy the combination conditions: father-son or sibling relationship The construction algorithmThe construction algorithm 1. Call Apriori[AS94] to generate the frequent itemsets, which have the size less than k. Record all the frequent itemsets together with their supports into list L. 2.Draft the CL-tree of the dataset according to the CLT algorithm 3.Until L is null 4.Iteratively combine the frequent itemsets which satisfy the combination conditions: father-son or sibling relationship Example: We assume the k is 2, after step 1, we get the frequent itemsets {A, B} {A, C},{B, C}, {B, E}, {B, D}, {D, E}. And f({B, C})>f({A, B})> f({B, E}) >f({B, D})>f({D, E}) (s represents the frequency of frequent itemsets). (b) is the CLT in step2. Example: 1.{A,C} does not satisfy the combination condition, filter out {A,C} 2.f{B,C} is the biggest and satisfies combination condition, combine them into (c) 3..Filter the frequent itemsets which have coverage with {B,C}, the {D,E} is left. 4..{D, E } is the frequent itemset and satisfies the combination condition, combine them into (d)
ExperimentExperiment DatabaseDatabase – The experiments are conducted on MNIST handwritten digit database. –MNIST consists of : a digit training dataseta digit training dataset a digit testing dataset.a digit testing dataset. Both are 28*28 gray-level digit datasets Both are 28*28 gray-level digit datasets DatabaseDatabase – The experiments are conducted on MNIST handwritten digit database. –MNIST consists of : a digit training dataseta digit training dataset a digit testing dataset.a digit testing dataset. Both are 28*28 gray-level digit datasets Both are 28*28 gray-level digit datasets
ExperimentExperiment Preprocessing of MNIST databasePreprocessing of MNIST database –Binarization :We use a global binarization method to binarize MNIST datasets. –Feature Extraction[Bakis68]: 4*4*6=96 dimension feature Preprocessing of MNIST databasePreprocessing of MNIST database –Binarization :We use a global binarization method to binarize MNIST datasets. –Feature Extraction[Bakis68]: 4*4*6=96 dimension feature
ExperimentExperiment We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output.We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output. We compare LNCLT with CLT in :We compare LNCLT with CLT in : –Data fitness---log likelihood –Recognition rate We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output.We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output. We compare LNCLT with CLT in :We compare LNCLT with CLT in : –Data fitness---log likelihood –Recognition rate
Experimental results Data fitness---Log likelihood testing
Experimental results Recognition rate We randomly selected 1000 digits as test datasets from the digit testing dataset. We do the testing 10 times Recognition rate We randomly selected 1000 digits as test datasets from the digit testing dataset. We do the testing 10 times
Work2: Bound approach in semi-Naïve Bayesian network 1.A bounded Semi-Naïve Bayesian network(SNB). 2. We reduced the SNB into a network which has the same number K of nodes in every subset, K is the bound parameter. 3. We use Linear programming to do the optimization. 4. Our solution is shown sub-optimal 1.A bounded Semi-Naïve Bayesian network(SNB). 2. We reduced the SNB into a network which has the same number K of nodes in every subset, K is the bound parameter. 3. We use Linear programming to do the optimization. 4. Our solution is shown sub-optimal
Comparison between our model and traditional SNB Time costTime cost –Our model can be solved in a polynomial time –Traditional SNB has an exponential time cost StructureStructure –Each large node in our model has the same number of nodes K, K is a bound parameter –The number of nodes in subsets of traditional SNB are not same and some values of this number may be very large. PerformancePerformance –Our model is shown to be a sub-optimal in the bound restriction –In traditional SNB, there is no evidence that show it is optimal or sub- optimal Time costTime cost –Our model can be solved in a polynomial time –Traditional SNB has an exponential time cost StructureStructure –Each large node in our model has the same number of nodes K, K is a bound parameter –The number of nodes in subsets of traditional SNB are not same and some values of this number may be very large. PerformancePerformance –Our model is shown to be a sub-optimal in the bound restriction –In traditional SNB, there is no evidence that show it is optimal or sub- optimal
Experimental results We evaluate our approach on Tic and vote dataset from UCI machine learning repositoryWe evaluate our approach on Tic and vote dataset from UCI machine learning repository
Future work Evaluate our approaches based on a large number of datasets in Machine Learning repository from UCIEvaluate our approaches based on a large number of datasets in Machine Learning repository from UCI Build a Bayesian network which combine the upgrading strategy and bound strategyBuild a Bayesian network which combine the upgrading strategy and bound strategy –In fact we are considering if we can upgrade our bounded- SNB into a mixture model of bounded-SNB. Evaluate our approaches based on a large number of datasets in Machine Learning repository from UCIEvaluate our approaches based on a large number of datasets in Machine Learning repository from UCI Build a Bayesian network which combine the upgrading strategy and bound strategyBuild a Bayesian network which combine the upgrading strategy and bound strategy –In fact we are considering if we can upgrade our bounded- SNB into a mixture model of bounded-SNB.
ConclusionConclusion A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers.A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers. In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network.In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network. The experimental results show that these two approaches are promising and encouraging.The experimental results show that these two approaches are promising and encouraging. A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers.A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers. In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network.In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network. The experimental results show that these two approaches are promising and encouraging.The experimental results show that these two approaches are promising and encouraging.
Main Reference [AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB [Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14,(pp ) [Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29,(pp ) [Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proceedings of sixth European working session on learning (pp ). Springer-Verlag [Maxwell1995] Learning Bayesian Networks is NP-Complete [Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible inference, Morgan Kaufmann. [Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information Theory Based Approach. In Proceedings of ACM CIKM’97 [Cheng2001] Cheng, J. and Greiner, R. 2001, Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S. Matwin(Eds.): AI 2001, LNAI 2056, (pp ), [Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, (pp. 165—174) [Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology Cambridge, Massachusetts 02139, [AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB [Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14,(pp ) [Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29,(pp ) [Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proceedings of sixth European working session on learning (pp ). Springer-Verlag [Maxwell1995] Learning Bayesian Networks is NP-Complete [Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible inference, Morgan Kaufmann. [Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information Theory Based Approach. In Proceedings of ACM CIKM’97 [Cheng2001] Cheng, J. and Greiner, R. 2001, Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S. Matwin(Eds.): AI 2001, LNAI 2056, (pp ), [Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, (pp. 165—174) [Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology Cambridge, Massachusetts 02139,
Q& A Thanks!Thanks!
Work2: Bound strategy on Semi-Naïve BN We restrict the semi-naïve network into not too complex structure.We restrict the semi-naïve network into not too complex structure. Bounded-SNB MODEL DEFINITION Bounded-SNB MODEL DEFINITION Large Node Bounded semi-Naïve BN Model
Reduce Bounded-SNB MODEL According to Lemma 1, given a bound K, we should not separate the variables set into too many small subsets. Or it is more possible that we can combine some of the subsets into a new subset whose cardinality is no greater than K, thus the new SNB will be coarser than the old one. From this viewpoint, we reduce the searching space of BLN-SNB into a K-regular SNB space since there are no possibility that a SNB coarser than K-regular SNB exists in the K-bound. Even though it is reasonable to search the maximum likelihood SNB in the K-regular-SNB space, we won't say that: a K-regular SNB is absolutely better than a non-K-regular SNB with the biggest cardinality no more than K. It is obvious some non-K-regular SNBs can not be combined into a K-regular SNB. Thus in such a way, we reduce the searching space into a sub-space of K-bound SNB.
Difference between our model & traditional SNB 1.Different approach Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost.Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost. Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity.Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity. 2.Different performance There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure.There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure. Our approach can maintain a sub-optimal structure.Our approach can maintain a sub-optimal structure. 1.Different approach Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost.Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost. Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity.Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity. 2.Different performance There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure.There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure. Our approach can maintain a sub-optimal structure.Our approach can maintain a sub-optimal structure.
K-Bounded-SNB Problem K-Bounded-SNB Problem: Finding the m= [n/K ] K-cardinality subsets from attributes set which satisfy the SNB conditions to maximize the log likelihood (3). [x] means rounding the x to the nearest integer K-Bounded-SNB Problem: Finding the m= [n/K ] K-cardinality subsets from attributes set which satisfy the SNB conditions to maximize the log likelihood (3). [x] means rounding the x to the nearest integer
Transforming into Integer Programming Problem Model definition If we relax the (6) into 0 x 1, IP is transformed into a Linear Programming problem which can be solved in a polynomial time. If we relax the (6) into 0 x 1, IP is transformed into a Linear Programming problem which can be solved in a polynomial time.
Computational complexity analysis Traditional SNB time cost is exponential cost Our model is polynomial time cost