1 Bayesian Methods
2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which P(C i /X) is maximum P(C i /X)= P(X/C i ) P(C i ) / P(X) P(X/C i ) P(C i ) P(x 1 /C i ) P(x 2 /C i )…P(x m /C i ) P(C i ) Naïvely assumes that each x i is independent We represent P(X/C i ) by P(X), etc. when unambiguous
3 Bayesian Belief Networks Naïve Bayes assumes independence between attributes – Not always correct! If we don’t assume independence, the problem becomes exponential – every attribute can be dependent on every other attribute. Luckily, in real life most attributes don’t depend (directly) on other attributes. A Bayesian network explicitly encodes dependencies between attributes.
4 Bayesian Belief Network FamilyHistory Smoker LungCancerEmphysemia PositiveXRayDyspnea FH,SFH,!S !FH,S!FH,!S LC !LC DAG Conditional Probability Table for LungCancer P(X) = P(x 1 | Parents(x 1 )) P(x 2 | Parents(x 2 ))…P(x m | Parents(x m )) e.g. P(PositiveXRay, Dyspnea)
5 Maximum Entropy Approach Think s, keywords, spam / non-spam Given a new data point X={x 1,x 2,…,x m } to classify calculate P(C i /X) for each class C i. Select C i for which P(C i /X) is maximum P(C i /X)= P(X/C i ) P(C i ) / P(X) P(X/C i ) P(C i ) Naïve Bayes assumes that each x i is independent Instead estimate P(X/C i ) directly from training data: support C i (X) Problem: There may be no instance of X in training data. – Training data is usually sparse Solution: Estimate P(X/C i ) from available features in training data: P(Y j /C i ) might be known for several Y j
6 Background: Shannon’s Entropy An expt has several possible outcomes In N expts, suppose each outcome occurs M times This means there are N/M possible outcomes To represent each outcome, we need log N/M bits. – This generalizes even when all outcomes are not equally frequent. – Reason: For an outcome j that occurs M times, there are N/M equi-probable events among which only one cp to j Since p i = M / N, information content of an outcome is -log p i So, expected info content: H = - Σ p i log p i
7 Maximum Entropy Principle Entropy corresponds to the disorder in a system – Intuition: A highly ordered system will require less bits to represent it If we do not have evidence for any particular order in a system, we should assume that no such order exists The order that we know of can be represented in the form of constraints Hence, we should maximize the entropy of a system subject to the known constraints If the constraints are consistent, there is a unique solution that maximizes entropy.
8 Max Ent in Classification Among the distributions P(X/C i ), choose the one that has maximum entropy. Use the selected distribution to classify according to bayesian approach.
9 Association Rule Based Methods
10 CPAR, CMAR, etc. Separate training data for each class Find frequent itemsets in each class – Class Association Rules: LHS = frequent itemset, RHS = class label To classify record R, find all association rules of each class that apply on R. Combine the evidence of rules to decide which class R belongs to. – E.g. Add the probabilities of the best k rules. – Mathematically incorrect, but work well in practice.
11 Max Ent + Frequent Itemsets ACME
12 ACME The frequent itemsets of each class, with their probabilities are used as constraints in a max-entropy model. – Evidences are combined using max-ent – Mathematically robust In practice, frequent itemsets represent all the significant constraints of a class. – Best in theory and practice But, slow.
13 Preliminaries Record Class a, b, c C 1 b, c C 1 a, d C 2 a, c C 1 I = {a, b, c, d} C = {C 1,C 2 } features classes a, b, d ? query
14 Frequent Itemsets Record Class a, b, c C 1 b, c C 1 a, d C 2 a, c C 1 An itemset whose frequency is greater than a minimum-support is called a frequent itemset. Frequent itemsets are mined using Apriori Algorithm. Ex: If minimum-support is 2, then {b,c} will be a frequent itemset.
15 Split Data by Classes Records a, b, c b, c a, c Records a, d C1C1 C2C2 S1S1 S2S2 Frequent Itemsets of C 2 Frequent Itemsets of C 1 apriori
16 Build Constraints for a class Records a, b, c b, c a, c s j p j b, c 0.67 a, b, c 0.33 b 0.67 C1C1 Constraints of C 1
17 Build distribution of class C 1 X P(X|C 1 ) a b c d s i p i b, c 0.67 a, b, c 0.33 b 0.67 Total possible records – 2 4 in number Maximum Entropy Principle: Build a distribution P(X|C 1 ) that conforms to the constraints and has the highest Entropy constraints
18 Log-Linear Modeling These µ‘s can be computed by an iterative fitting algorithm like the GIS algorithm.
19 Generalized Iterative Scaling Algorithm # N items, M constraints P(X k ) = 1 / 2 N // for k = (1…2 N ); Uniform distribution j = 1 # for j = (1…M) while all constraints not satisfied: for each constraint C j : S j = (k: T k satisfies Y j ) P(X k ) j *= d j / S j P(X k ) = 0 (j satisfied by T k ) j # 0 is to ensure that k P(X k ) = 1
20 Problem with the Log-Linear Model s i p i b, c 0.67 b Solution does not exist if P(X|C j ) = 0 for any X. Prob. is 0 for all ‘X’ which have ‘b=1’ but ‘c=0’.
21 Fix to the Model X P(X|C 1 ) a b c d set to Fix: Define the model on only those ‘X’ whose probability is non-zero. Explicitly set these record probabilities to zero and learn for µ’s without considering them. Learning time decreases as |X| decreases
22 Effect of pruning Dataset# ConsPruned X Austra(354) % Waveform(99) 241.3% Cleve(246) % Diabetes(85) % German(54) % Heart(115) % Breast(189) % Lymph(29) % Pima(87) 558.6% Datasets chosen from UCI ML Repository.
23 Making the approach scalable (1) Remove non-informative constraints. – A constraint is informative if it can distinguish between classes very well. Use the standard information measure Ex: s 1 = {a,b,c} P( C 1 | s 1 ) = 0.45 and P( C 2 | s 1 ) = 0.55 Remove {a,b,c} from the constraint set. s 2 = { b, c } P( C 1 | s 2 ) = 0.8 and P( C 2 | s 2 ) = 0.2 Include { b, c } in the constraint set.
24 Making the approach scalable (2) Splitting: Split the set of features ‘I’ into groups that are independent of each other. – Two groups of features are independent of each other if they don’t have an overlapping constraint between them Global P(.) can be calculated by merging individual P(.)’s of each group in a naïve-bayes fashion Ex: I = {a,b,c,d}, and constraints are {a}, {a,b} and {c,d}. Split I into I 1 ={a,b} and I 2 ={c,d}. Learn Log-Linear models P 1 (.) for I 1 ={a,b} and P 2 (.) for I 2 ={c,d} P(b,c) = P 1 (b) * P 2 (c)