Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty.

Similar presentations


Presentation on theme: "Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty."— Presentation transcript:

1 Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

2 © Daniel S. Weld 2 Logistics Reading Ch 13 Ch 14 thru 14.3

3 © Daniel S. Weld 3 DT Learning as Search Nodes Operators Initial node Heuristic? Goal? Type of Search? Decision Trees Tree Refinement: Sprouting the tree Smallest tree possible: a single leaf Information Gain Best tree possible (???) Hill climbing

4 Bias Restriction Preference

5 © Daniel S. Weld 5 Decision Tree Representation Outlook Humidity Wind Yes No Sunny Overcast Rain High Strong Normal Weak Good day for tennis? Leaves = classification Arcs = choice of value for parent attribute Decision tree is equivalent to logic in disjunctive normal form G-Day  (Sunny  Normal)  Overcast  (Rain  Weak)

6 © Daniel S. Weld 6 Overfitting Model complexity (e.g. Number of Nodes in Decision tree ) Accuracy 0.9 0.8 0.7 0.6 On training data On test data

7 © Daniel S. Weld 7 Machine Learning Outline Supervised Learning Review Ensembles of Classifiers Bagging Cross-validated committees Boosting Stacking Co-training

8 © Daniel S. Weld 8 Voting

9 © Daniel S. Weld 9 Ensembles of Classifiers Assume Errors are independent (suppose 30% error) Majority vote Probability that majority is wrong… If individual area is 0.3 Area under curve for  11 wrong is 0.026 Order of magnitude improvement! Ensemble of 21 classifiers Prob 0.2 0.1 Number of classifiers in error = area under binomial distribution

10 © Daniel S. Weld 10 Constructing Ensembles Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Now train a classifier on each set Cross-validated committees Holdout

11 © Daniel S. Weld 11 Ensemble Construction II Generate k sets of training examples For each set Draw m examples randomly (with replacement) From the original set of m examples Each training set corresponds to 63.2% of original (+ duplicates) Now train classifier on each set Bagging

12 © Daniel S. Weld 12 Ensemble Creation III Maintain prob distribution over set of training ex Create k sets of training data iteratively: On iteration i Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error ex Create harder and harder learning problems... “Bagging with optimized choice of examples” Boosting

13 © Daniel S. Weld 13 Ensemble Creation IV Stacking Train several base learners Next train meta-learner Learns when base learners are right / wrong Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = ‘test set’ from cross validation

14 © Daniel S. Weld 14 Machine Learning Outline Supervised Learning Review Ensembles of Classifiers Bagging Cross-validated committees Boosting Stacking Co-training

15 © Daniel S. Weld 15 Co-Training Motivation Learning methods need labeled data Lots of pairs Hard to get… (who wants to label data?) But unlabeled data is usually plentiful… Could we use this instead??????

16 © Daniel S. Weld 16 Co-training Have little labeled data + lots of unlabeled Each instance has two parts: x = [x1, x2] x1, x2 conditionally independent given f(x) Each half can be used to classify instance  f1, f2 such that f1(x1) ~ f2(x2) ~ f(x) Both f1, f2 are learnable f1  H1, f2  H2,  learning algorithms A1, A2 Suppose

17 © Daniel S. Weld 17 Without Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] f2f2 A2A2 Unlabeled Instances A1A1 f1f1 } Combine with ensemble? Bad!! Not using Unlabeled Instances! f’

18 © Daniel S. Weld 18 Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] Lots of Labeled Instances f2f2 Hypothesis A2A2 Unlabeled Instances A1A1 f1f1

19 © Daniel S. Weld 19 Observations Can apply A 1 to generate as much training data as one wants If x 1 is conditionally independent of x 2 / f(x), then the error in the labels produced by A 1 will look like random noise to A 2 !!! Thus no limit to quality of the hypothesis A 2 can make

20 © Daniel S. Weld 20 Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] Lots of Labeled Instances Hypothesis A2A2 Unlabeled Instances A1A1 f1f1 f2f2 f2f2 Lots of f2f2 f1f1

21 © Daniel S. Weld 21 It really works! Learning to classify web pages as course pages x1 = bag of words on a page x2 = bag of words from all anchors pointing to a page Naïve Bayes classifiers 12 labeled pages 1039 unlabeled Percentage error

22 Representing Uncertainty

23 © Daniel S. Weld 23 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood the test of time! © Daniel S. Weldd

24 © Daniel S. Weld 24 Aspects of Uncertainty Suppose you have a flight at 12 noon When should you leave for SEATAC What are traffic conditions? How crowded is security? Leaving 18 hours early may get you there But … ?

25 © Daniel S. Weld 25 Decision Theory = Probability + Utility Theory Min before noonP(arrive-in-time) 20 min0.05 30 min0.25 45 min0.50 60 min0.75 120 min0.98 1080 min0.99 Depends on your preferences Utility theory: representing & reasoning about preferences

26 © Daniel S. Weld 26 What Is Probability? Probability: Calculus for dealing with nondeterminism and uncertainty Cf. Logic Probabilistic model: Says how often we expect different things to occur

27 © Daniel S. Weld 27 What Is Statistics? Statistics 1: Describing data Statistics 2: Inferring probabilistic models from data Structure Parameters

28 © Daniel S. Weld 28 Why Should You Care? The world is full of uncertainty Logic is not enough Computers need to be able to handle uncertainty Probability: new foundation for AI (& CS!) Massive amounts of data around today Statistics and CS are both about data Statistics lets us summarize and understand it Statistics is the basis for most learning Statistics lets data do our work for us

29 © Daniel S. Weld 29 Outline Basic notions Atomic events, probabilities, joint distribution Inference by enumeration Independence & conditional independence Bayes’ rule Bayesian Networks Statistical Learning Dynamic Bayesian networks (DBNs) Markov decision processes (MDPs)

30 © Daniel S. Weld 30 Prop. Logic vs. Probability Symbol: Q, R …Random variable: Q … Boolean values: T, F Domain: you specify e.g. {heads, tails} [1, 6] State of the world: Assignment to Q, R … Z Atomic event: complete specification of world: Q… Z Mutually exclusive Exhaustive Prior probability (aka Unconditional prob: P(Q) Joint distribution: Prob. of every atomic event

31 Types of Random Variables

32 Axioms of Probability Theory Just 3 are enough to build entire theory! 1. All probabilities between 0 and 1 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is: A B A  B True

33 Prior and Joint Probability We will see later how any question can be answered by the joint distribution 0.2

34 Conditional (or Posterior) Probability Conditional or posterior probabilities e.g., P(cavity | toothache) = 0.8 i.e., given that Toothache is true (and all I know) Notation for conditional distributions: P(Cavity | Toothache) = 2-element vector of 2-element vectors (2 P values when Toothache is true and 2 when false) If we know more, e.g., cavity is also given, then we have P(cavity | toothache, cavity) = ? New evidence may be irrelevant, allowing simplification: P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8 1

35 Conditional Probability P(A | B) is the probability of A given B Assumes that B is the only info known. Defined as: A B ABAB True

36 Dilemma at the Dentist’s What is the probability of a cavity given a toothache? What is the probability of a cavity given the probe catches?

37 © Daniel S. Weld 37 Inference by Enumeration P(toothache)=.108+.012+.016+.064 =.20 or 20% This process is called “Marginalization”

38 © Daniel S. Weld 38 Inference by Enumeration P(toothache  cavity =.20 + ??.072 +.008.28

39 © Daniel S. Weld 39 Inference by Enumeration

40 Problems with Enumeration Worst case time: O(d n ) Where d = max arity of random variables e.g., d = 2 for Boolean (T/F) And n = number of random variables Space complexity also O(d n ) Size of joint distribution Problem: Hard/impossible to estimate all O(d n ) entries for large problems

41 © Daniel S. Weld 41 Independence A and B are independent iff: These two constraints are logically equivalent Therefore, if A and B are independent:

42 © Daniel S. Weld 42 Independence True B AA  B

43 © Daniel S. Weld 43 Independence Complete independence is powerful but rare What to do if it doesn’t hold?

44 © Daniel S. Weld 44 Conditional Independence True B AA  B A&B not independent, since P(A|B) < P(A)

45 © Daniel S. Weld 45 Conditional Independence True B AA  B C B  C ACAC But: A&B are made independent by  C P(A|  C) = P(A|B,  C)

46 © Daniel S. Weld 46 Conditional Independence Instead of 7 entries, only need 5

47 © Daniel S. Weld 47 Conditional Independence II P(catch | toothache, cavity) = P(catch | cavity) P(catch | toothache,  cavity) = P(catch |  cavity) Why only 5 entries in table?

48 © Daniel S. Weld 48 Power of Cond. Independence Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!! Conditional independence is the most basic & robust form of knowledge about uncertain environments.

49 Next Up… Bayes’ Rule Bayesian Inference Bayesian Networks Bayes rules!

50 © Daniel S. Weld 50 Bayes Rule Simple proof from def of conditional probability: QED: (Def. cond. prob.) (Mult by P(H) in line 2) (Substitute #3 in #1)

51 © Daniel S. Weld 51 Use to Compute Diagnostic Probability from Causal Probability E.g. let M be meningitis, S be stiff neck P(M) = 0.0001, P(S) = 0.1, P(S|M)= 0.8 P(M|S) =

52 © Daniel S. Weld 52 Bayes’ Rule & Cond. Independence

53 © Daniel S. Weld 53 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation & inference BNs provide a graphical representation of conditional independence relations in P usually quite compact requires assessment of fewer parameters, those being quite natural (e.g., causal) efficient (usually) inference: query answering and belief update

54 © Daniel S. Weld 54 Independence (in the extreme) If X 1, X 2,... X n are mutually independent, then P(X 1, X 2,... X n ) = P(X 1 )P(X 2 )... P(X n ) Joint can be specified with n parameters cf. the usual 2 n -1 parameters required While extreme independence is unusual, Conditional independence is common BNs exploit this conditional independence

55 © Daniel S. Weld 55 An Example Bayes Net Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

56 © Daniel S. Weld 56 Earthquake Example (con’t) If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls P(N1|N2,A,E,B) = P(N1|A) also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E) By the chain rule we have P(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B) Full joint requires only 10 parameters (cf. 32) Earthquake Burglary Alarm Nbr2CallsNbr1Calls Radio

57 © Daniel S. Weld 57 BNs: Qualitative Structure Graphical structure of BN reflects conditional independence among variables Each variable X is a node in the DAG Edges denote direct probabilistic influence usually interpreted causally parents of X are denoted Par(X) X is conditionally independent of all nondescendents given its parents Graphical test exists for more general independence “Markov Blanket”

58 © Daniel S. Weld 58 Given Parents, X is Independent of Non-Descendants

59 © Daniel S. Weld 59 For Example EarthquakeBurglary Alarm Nbr2CallsNbr1Calls Radio

60 © Daniel S. Weld 60 Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X)  Childs(X)  Par(Childs(X))

61 © Daniel S. Weld 61 Conditional Probability Tables Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

62 © Daniel S. Weld 62 Conditional Probability Tables For complete spec. of joint dist., quantify BN For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)| If X 1, X 2,... X n is any topological sort of the network, then we are assured: P(X n,X n-1,...X 1 ) = P(X n | X n-1,...X 1 ) · P(X n-1 | X n-2,… X 1 ) … P(X 2 | X 1 ) · P(X 1 ) = P(X n | Par(X n )) · P(X n-1 | Par(X n-1 )) … P(X 1 )

63 © Daniel S. Weld 63 Inference in BNs The graphical independence representation yields efficient inference schemes We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence Computations organized by network topology One simple algorithm: variable elimination (VE)

64 © Daniel S. Weld 64 P(B | J=true, M=true) EarthquakeBurglary Alarm MaryJohn Radio P(b|j,m) =  P(b)  P(e)  P(a|b,e)P(j|a)P(m,a) e a

65 © Daniel S. Weld 65 Structure of Computation Dynamic Programming

66 © Daniel S. Weld 66 Variable Elimination A factor is a function from some set of variables into a specific value: e.g., f(E,A,N1) CPTs are factors, e.g., P(A|E,B) function of A,E,B VE works by eliminating all variables in turn until there is a factor with only query variable To eliminate a variable: join all factors containing that variable (like DB) sum out the influence of the variable on new factor exploits product form of joint distribution

67 © Daniel S. Weld 67 Example of VE: P(N1) Earthqk Burgl Alarm N2N1 P(N1) =  N2,A,B,E P(N1,N2,A,B,E) =  N2,A,B,E P(N1|A)P(N2|A) P(B)P(A|B,E)P(E) =  A P(N1|A)  N2 P(N2|A)  B P(B)  E P(A|B,E)P(E) =  A P(N1|A)  N2 P(N2|A)  B P(B) f1(A,B) =  A P(N1|A)  N2 P(N2|A) f2(A) =  A P(N1|A) f3(A) = f4(N1)

68 © Daniel S. Weld 68 Notes on VE Each operation is a simply multiplication of factors and summing out a variable Complexity determined by size of largest factor e.g., in example, 3 vars (not 5) linear in number of vars, exponential in largest factorelimination ordering greatly impacts factor size optimal elimination orderings: NP-hard heuristics, special structure (e.g., polytrees) Practically, inference is much more tractable using structure of this sort


Download ppt "Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty."

Similar presentations


Ads by Google