Study of Bayesian network classifier Huang Kaizhu Huang Kaizhu Supervisors: Prof. Irwin King Supervisors: Prof. Irwin King Prof. Lyu Rung Tsong Michael.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Simulation Metamodeling using Dynamic Bayesian Networks in Continuous Time Jirka Poropudas (M.Sc.) Aalto University School of Science and Technology Systems.
Visual Recognition Tutorial
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Learning Maximum Likelihood Bounded Semi-Naïve Bayesian Network Classifier Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.
Reduced Support Vector Machine
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Huang,Kaizhu Classifier based on mixture of density tree CSE Department, The Chinese University of Hong Kong.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Constructing a Large Node Chow-Liu Tree Based on Frequent Itemsets Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Finite mixture model of Bounded Semi- Naïve Bayesian Network Classifiers Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.
Learning Maximum Likelihood Bounded Semi-Naïve Bayesian network classifiers Huang, Kaizhu Sept.25, 2002 Huang, Kaizhu Sept.25, 2002.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
A Comparison Between Bayesian Networks and Generalized Linear Models in the Indoor/Outdoor Scene Classification Problem.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Bayesian Networks Martin Bachler MLA - VO
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Data Mining and Decision Support
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
NTU & MSRA Ming-Feng Tsai
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Approximation Algorithms based on linear programming.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CS 9633 Machine Learning Support Vector Machines
Data Science Algorithms: The Basic Methods
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Rule Induction for Classification Using
Reading Notes Wang Ning Lab of Database and Information Systems
Discriminative Training of Chow-Liu tree Multinet Classifiers
Data Mining Lecture 11.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hidden Markov Models Part 2: Algorithms
Pattern Recognition and Image Analysis
Pegna, J.M., Lozano, J.A., and Larragnaga, P.
Speech recognition, machine learning
Feature Selection Methods
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Read R&N Ch Next lecture: Read R&N
A task of induction to find patterns
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Speech recognition, machine learning
Presentation transcript:

Study of Bayesian network classifier Huang Kaizhu Huang Kaizhu Supervisors: Prof. Irwin King Supervisors: Prof. Irwin King Prof. Lyu Rung Tsong Michael Prof. Lyu Rung Tsong Michael Markers: Prof. Chan Lai Wan Prof. Wong Kin Hong Prof. Wong Kin Hong

OutlineOutline BackgroundBackground –What is Bayesian network? –How Bayesian networks can be used as classifiers? –Why choose Bayesian network? –What is problem of Learning Bayesian network ? My main worksMy main works –Large-Node Chow-Liu-tree –Maximum likelihood Large-Node-Bounded semi-Naïve BN Future workFuture work ConclusionConclusion BackgroundBackground –What is Bayesian network? –How Bayesian networks can be used as classifiers? –Why choose Bayesian network? –What is problem of Learning Bayesian network ? My main worksMy main works –Large-Node Chow-Liu-tree –Maximum likelihood Large-Node-Bounded semi-Naïve BN Future workFuture work ConclusionConclusion

BackgroundBackground What is Bayesian Network(BN)?What is Bayesian Network(BN)? –Composed of a “structure” component G and a “parameter” component . –G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E. And the nodes represent the attributes, the edges between the nodes represent the dependence relationship between the nodes. –  is a conditional probability table. –It encodes the following joint probability among the nodes (X1,X2,…,Xn): What is Bayesian Network(BN)?What is Bayesian Network(BN)? –Composed of a “structure” component G and a “parameter” component . –G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E. And the nodes represent the attributes, the edges between the nodes represent the dependence relationship between the nodes. –  is a conditional probability table. –It encodes the following joint probability among the nodes (X1,X2,…,Xn):

Background(con’t)Background(con’t) The Bayesian network above encodes the following probability relationship. P(F, B, L, D, H) =P(F) P(B)P(L |F) P(D |F,B)P(H|D) The Bayesian network above encodes the following probability relationship. P(F, B, L, D, H) =P(F) P(B)P(L |F) P(D |F,B)P(H|D) Structure component Parameter component Example of Bayesian network:

Background (con’t) How can BN be a classifier?How can BN be a classifier? –Firstly use BN to model the dataset –Then use the distribution encoded in BN to do classification How can BN be a classifier?How can BN be a classifier? –Firstly use BN to model the dataset –Then use the distribution encoded in BN to do classification

Background (con’t) Why choose Bayesian network?Why choose Bayesian network? –Bayesian network represents some inner relationship between the attributes –The joint probability based on BN can be written as a decomposable form Why choose Bayesian network?Why choose Bayesian network? –Bayesian network represents some inner relationship between the attributes –The joint probability based on BN can be written as a decomposable form

Background (con’t) What is the problem of learning Bayesian network?What is the problem of learning Bayesian network? –Given a training data set D={u 1, u 2, u 3 …u N } of instances of attributes U, find a network B that best matches D. What’s the difficulty in learning BN?What’s the difficulty in learning BN? –Generally speaking, BN optimization problem is intractable. Two ApproachesTwo Approaches –Either we constrain the searching in a certain restricted class of networks (Naïve BN, Semi-Naïve BN, CL-tree etc) Q1: Are these restricted class enough to represent the data? Q1: Are these restricted class enough to represent the data? –Or we adopt some heuristic methods on general networks (K2 etc) Q2: Are the heuristic methods on general network efficient ?Q2: Are the heuristic methods on general network efficient ? Q3: Are the heuristic methods on general network redundant to represent the data? Q3: Are the heuristic methods on general network redundant to represent the data? What is the problem of learning Bayesian network?What is the problem of learning Bayesian network? –Given a training data set D={u 1, u 2, u 3 …u N } of instances of attributes U, find a network B that best matches D. What’s the difficulty in learning BN?What’s the difficulty in learning BN? –Generally speaking, BN optimization problem is intractable. Two ApproachesTwo Approaches –Either we constrain the searching in a certain restricted class of networks (Naïve BN, Semi-Naïve BN, CL-tree etc) Q1: Are these restricted class enough to represent the data? Q1: Are these restricted class enough to represent the data? –Or we adopt some heuristic methods on general networks (K2 etc) Q2: Are the heuristic methods on general network efficient ?Q2: Are the heuristic methods on general network efficient ? Q3: Are the heuristic methods on general network redundant to represent the data? Q3: Are the heuristic methods on general network redundant to represent the data?

Background (con’t) Problems in two approachesProblems in two approaches –Q1: Are these restricted class enough to represent the data? No, in many cases, they are really too limited in expression ability to model the data.No, in many cases, they are really too limited in expression ability to model the data. –Q2: Are the heuristic methods on general network efficient ? No, they have a big search space, which will be greatly time- consumingNo, they have a big search space, which will be greatly time- consuming –Q3: is it possible that general networks obtained by the heuristic methods are redundant to represent the data? Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem. Problems in two approachesProblems in two approaches –Q1: Are these restricted class enough to represent the data? No, in many cases, they are really too limited in expression ability to model the data.No, in many cases, they are really too limited in expression ability to model the data. –Q2: Are the heuristic methods on general network efficient ? No, they have a big search space, which will be greatly time- consumingNo, they have a big search space, which will be greatly time- consuming –Q3: is it possible that general networks obtained by the heuristic methods are redundant to represent the data? Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.

Possible solutions –Upgrading Solution How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure?How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure? –Bound Solution Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter. –Upgrading Solution How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure?How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure? –Bound Solution Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.

Work1:Large node Chow-Liu tree Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT)Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT) –What is the restriction of CLT? CLT restricts the network in a tree structure among the variablesCLT restricts the network in a tree structure among the variables –Shortcomings caused by the restriction CLT can not represent many dataset with a non-tree underlying structure.CLT can not represent many dataset with a non-tree underlying structure. –Observations: A “large node tree” may partiallyA “large node tree” may partially solve this shortcoming. solve this shortcoming. –Example: Right figureRight figure Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT)Upgrade Chow-Liu tree(CLT) into Large node Chow- Liu tree(LNCLT) –What is the restriction of CLT? CLT restricts the network in a tree structure among the variablesCLT restricts the network in a tree structure among the variables –Shortcomings caused by the restriction CLT can not represent many dataset with a non-tree underlying structure.CLT can not represent many dataset with a non-tree underlying structure. –Observations: A “large node tree” may partiallyA “large node tree” may partially solve this shortcoming. solve this shortcoming. –Example: Right figureRight figure

Work1:Large node Chow-Liu tree Large-node, which is a combination of several nodes, may partially relax the tree restriction.Large-node, which is a combination of several nodes, may partially relax the tree restriction. In forming a large node,There are two requirements. In forming a large node,There are two requirements. –Requirement 1 Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other.Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other. –Requirement 2 Large-node can not be too “large” or the probability estimation of this large node will not be not reliableLarge-node can not be too “large” or the probability estimation of this large node will not be not reliable –An extreme situation is that we combine all the nodes into a large node. This situation will lost all the advantages of Bayesian network. Large-node, which is a combination of several nodes, may partially relax the tree restriction.Large-node, which is a combination of several nodes, may partially relax the tree restriction. In forming a large node,There are two requirements. In forming a large node,There are two requirements. –Requirement 1 Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other.Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other. –Requirement 2 Large-node can not be too “large” or the probability estimation of this large node will not be not reliableLarge-node can not be too “large” or the probability estimation of this large node will not be not reliable –An extreme situation is that we combine all the nodes into a large node. This situation will lost all the advantages of Bayesian network.

Upgrade CLT into Large-Node-CLT A bounded Frequent itemset can satisfy the Requirement 1 & 2A bounded Frequent itemset can satisfy the Requirement 1 & 2 –What is Frequent itemset? It is the set of attributes that come together with each other frequently.It is the set of attributes that come together with each other frequently. Example: Food store---{bread}, {button}, {bread, button}Example: Food store---{bread}, {button}, {bread, button} –An frequent itemset with high frequency is more like a “large node”. ---Requirement 1 ---Requirement 1 –We restrict that the the number of nodes involved in a large node is no greater than a K threshold ---Requirement 2 ---Requirement 2 –Frequent itemset can be obtained according to the algorithm Apriori in [AS1994] A bounded Frequent itemset can satisfy the Requirement 1 & 2A bounded Frequent itemset can satisfy the Requirement 1 & 2 –What is Frequent itemset? It is the set of attributes that come together with each other frequently.It is the set of attributes that come together with each other frequently. Example: Food store---{bread}, {button}, {bread, button}Example: Food store---{bread}, {button}, {bread, button} –An frequent itemset with high frequency is more like a “large node”. ---Requirement 1 ---Requirement 1 –We restrict that the the number of nodes involved in a large node is no greater than a K threshold ---Requirement 2 ---Requirement 2 –Frequent itemset can be obtained according to the algorithm Apriori in [AS1994]

Upgrade CLT into Large-Node-CLT The construction algorithmThe construction algorithm 1. Call Apriori[AS94] to generate the frequent itemsets, which have the size less than k. Record all the frequent itemsets together with their supports into list L. 2.Draft the CL-tree of the dataset according to the CLT algorithm 3.Until L is null 4.Iteratively combine the frequent itemsets which satisfy the combination conditions: father-son or sibling relationship The construction algorithmThe construction algorithm 1. Call Apriori[AS94] to generate the frequent itemsets, which have the size less than k. Record all the frequent itemsets together with their supports into list L. 2.Draft the CL-tree of the dataset according to the CLT algorithm 3.Until L is null 4.Iteratively combine the frequent itemsets which satisfy the combination conditions: father-son or sibling relationship Example: We assume the k is 2, after step 1, we get the frequent itemsets {A, B} {A, C},{B, C}, {B, E}, {B, D}, {D, E}. And f({B, C})>f({A, B})> f({B, E}) >f({B, D})>f({D, E}) (s represents the frequency of frequent itemsets). (b) is the CLT in step2. Example: 1.{A,C} does not satisfy the combination condition, filter out {A,C} 2.f{B,C} is the biggest and satisfies combination condition, combine them into (c) 3..Filter the frequent itemsets which have coverage with {B,C}, the {D,E} is left. 4..{D, E } is the frequent itemset and satisfies the combination condition, combine them into (d)

ExperimentExperiment DatabaseDatabase – The experiments are conducted on MNIST handwritten digit database. –MNIST consists of : a digit training dataseta digit training dataset a digit testing dataset.a digit testing dataset. Both are 28*28 gray-level digit datasets Both are 28*28 gray-level digit datasets DatabaseDatabase – The experiments are conducted on MNIST handwritten digit database. –MNIST consists of : a digit training dataseta digit training dataset a digit testing dataset.a digit testing dataset. Both are 28*28 gray-level digit datasets Both are 28*28 gray-level digit datasets

ExperimentExperiment Preprocessing of MNIST databasePreprocessing of MNIST database –Binarization :We use a global binarization method to binarize MNIST datasets. –Feature Extraction[Bakis68]: 4*4*6=96 dimension feature Preprocessing of MNIST databasePreprocessing of MNIST database –Binarization :We use a global binarization method to binarize MNIST datasets. –Feature Extraction[Bakis68]: 4*4*6=96 dimension feature

ExperimentExperiment We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output.We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output. We compare LNCLT with CLT in :We compare LNCLT with CLT in : –Data fitness---log likelihood –Recognition rate We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output.We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output. We compare LNCLT with CLT in :We compare LNCLT with CLT in : –Data fitness---log likelihood –Recognition rate

Experimental results Data fitness---Log likelihood testing

Experimental results Recognition rate We randomly selected 1000 digits as test datasets from the digit testing dataset. We do the testing 10 times Recognition rate We randomly selected 1000 digits as test datasets from the digit testing dataset. We do the testing 10 times

Work2: Bound approach in semi-Naïve Bayesian network 1.A bounded Semi-Naïve Bayesian network(SNB). 2. We reduced the SNB into a network which has the same number K of nodes in every subset, K is the bound parameter. 3. We use Linear programming to do the optimization. 4. Our solution is shown sub-optimal 1.A bounded Semi-Naïve Bayesian network(SNB). 2. We reduced the SNB into a network which has the same number K of nodes in every subset, K is the bound parameter. 3. We use Linear programming to do the optimization. 4. Our solution is shown sub-optimal

Comparison between our model and traditional SNB Time costTime cost –Our model can be solved in a polynomial time –Traditional SNB has an exponential time cost StructureStructure –Each large node in our model has the same number of nodes K, K is a bound parameter –The number of nodes in subsets of traditional SNB are not same and some values of this number may be very large. PerformancePerformance –Our model is shown to be a sub-optimal in the bound restriction –In traditional SNB, there is no evidence that show it is optimal or sub- optimal Time costTime cost –Our model can be solved in a polynomial time –Traditional SNB has an exponential time cost StructureStructure –Each large node in our model has the same number of nodes K, K is a bound parameter –The number of nodes in subsets of traditional SNB are not same and some values of this number may be very large. PerformancePerformance –Our model is shown to be a sub-optimal in the bound restriction –In traditional SNB, there is no evidence that show it is optimal or sub- optimal

Experimental results We evaluate our approach on Tic and vote dataset from UCI machine learning repositoryWe evaluate our approach on Tic and vote dataset from UCI machine learning repository

Future work Evaluate our approaches based on a large number of datasets in Machine Learning repository from UCIEvaluate our approaches based on a large number of datasets in Machine Learning repository from UCI Build a Bayesian network which combine the upgrading strategy and bound strategyBuild a Bayesian network which combine the upgrading strategy and bound strategy –In fact we are considering if we can upgrade our bounded- SNB into a mixture model of bounded-SNB. Evaluate our approaches based on a large number of datasets in Machine Learning repository from UCIEvaluate our approaches based on a large number of datasets in Machine Learning repository from UCI Build a Bayesian network which combine the upgrading strategy and bound strategyBuild a Bayesian network which combine the upgrading strategy and bound strategy –In fact we are considering if we can upgrade our bounded- SNB into a mixture model of bounded-SNB.

ConclusionConclusion A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers.A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers. In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network.In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network. The experimental results show that these two approaches are promising and encouraging.The experimental results show that these two approaches are promising and encouraging. A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers.A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers. In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network.In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network. The experimental results show that these two approaches are promising and encouraging.The experimental results show that these two approaches are promising and encouraging.

Main Reference [AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB [Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14,(pp ) [Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29,(pp ) [Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proceedings of sixth European working session on learning (pp ). Springer-Verlag [Maxwell1995] Learning Bayesian Networks is NP-Complete [Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible inference, Morgan Kaufmann. [Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information Theory Based Approach. In Proceedings of ACM CIKM’97 [Cheng2001] Cheng, J. and Greiner, R. 2001, Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S. Matwin(Eds.): AI 2001, LNAI 2056, (pp ), [Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, (pp. 165—174) [Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology Cambridge, Massachusetts 02139, [AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB [Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14,(pp ) [Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29,(pp ) [Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proceedings of sixth European working session on learning (pp ). Springer-Verlag [Maxwell1995] Learning Bayesian Networks is NP-Complete [Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible inference, Morgan Kaufmann. [Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information Theory Based Approach. In Proceedings of ACM CIKM’97 [Cheng2001] Cheng, J. and Greiner, R. 2001, Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S. Matwin(Eds.): AI 2001, LNAI 2056, (pp ), [Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, (pp. 165—174) [Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology Cambridge, Massachusetts 02139,

Q& A Thanks!Thanks!

Work2: Bound strategy on Semi-Naïve BN We restrict the semi-naïve network into not too complex structure.We restrict the semi-naïve network into not too complex structure. Bounded-SNB MODEL DEFINITION Bounded-SNB MODEL DEFINITION Large Node Bounded semi-Naïve BN Model

Reduce Bounded-SNB MODEL According to Lemma 1, given a bound K, we should not separate the variables set into too many small subsets. Or it is more possible that we can combine some of the subsets into a new subset whose cardinality is no greater than K, thus the new SNB will be coarser than the old one. From this viewpoint, we reduce the searching space of BLN-SNB into a K-regular SNB space since there are no possibility that a SNB coarser than K-regular SNB exists in the K-bound. Even though it is reasonable to search the maximum likelihood SNB in the K-regular-SNB space, we won't say that: a K-regular SNB is absolutely better than a non-K-regular SNB with the biggest cardinality no more than K. It is obvious some non-K-regular SNBs can not be combined into a K-regular SNB. Thus in such a way, we reduce the searching space into a sub-space of K-bound SNB.

Difference between our model & traditional SNB 1.Different approach Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost.Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost. Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity.Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity. 2.Different performance There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure.There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure. Our approach can maintain a sub-optimal structure.Our approach can maintain a sub-optimal structure. 1.Different approach Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost.Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost. Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity.Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity. 2.Different performance There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure.There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure. Our approach can maintain a sub-optimal structure.Our approach can maintain a sub-optimal structure.

K-Bounded-SNB Problem K-Bounded-SNB Problem: Finding the m= [n/K ] K-cardinality subsets from attributes set which satisfy the SNB conditions to maximize the log likelihood (3). [x] means rounding the x to the nearest integer K-Bounded-SNB Problem: Finding the m= [n/K ] K-cardinality subsets from attributes set which satisfy the SNB conditions to maximize the log likelihood (3). [x] means rounding the x to the nearest integer

Transforming into Integer Programming Problem Model definition If we relax the (6) into 0  x  1, IP is transformed into a Linear Programming problem which can be solved in a polynomial time. If we relax the (6) into 0  x  1, IP is transformed into a Linear Programming problem which can be solved in a polynomial time.

Computational complexity analysis Traditional SNB time cost is exponential cost Our model is polynomial time cost