Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty.

Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty

© Daniel S. Weld 3 DT Learning as Search Nodes Operators Initial node Heuristic? Goal? Type of Search? Decision Trees Tree Refinement: Sprouting the tree Smallest tree possible: a single leaf Information Gain Best tree possible (???) Hill climbing

Bias Restriction Preference

© Daniel S. Weld 5 Decision Tree Representation Outlook Humidity Wind Yes No Sunny Overcast Rain High Strong Normal Weak Good day for tennis? Leaves = classification Arcs = choice of value for parent attribute Decision tree is equivalent to logic in disjunctive normal form G-Day  (Sunny  Normal)  Overcast  (Rain  Weak)

© Daniel S. Weld 9 Ensembles of Classifiers Assume Errors are independent (suppose 30% error) Majority vote Probability that majority is wrong… If individual area is 0.3 Area under curve for  11 wrong is 0.026 Order of magnitude improvement! Ensemble of 21 classifiers Prob 0.2 0.1 Number of classifiers in error = area under binomial distribution

© Daniel S. Weld 10 Constructing Ensembles Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Now train a classifier on each set Cross-validated committees Holdout

© Daniel S. Weld 11 Ensemble Construction II Generate k sets of training examples For each set Draw m examples randomly (with replacement) From the original set of m examples Each training set corresponds to 63.2% of original (+ duplicates) Now train classifier on each set Bagging

© Daniel S. Weld 12 Ensemble Creation III Maintain prob distribution over set of training ex Create k sets of training data iteratively: On iteration i Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error ex Create harder and harder learning problems... “Bagging with optimized choice of examples” Boosting

© Daniel S. Weld 13 Ensemble Creation IV Stacking Train several base learners Next train meta-learner Learns when base learners are right / wrong Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = ‘test set’ from cross validation

© Daniel S. Weld 15 Co-Training Motivation Learning methods need labeled data Lots of pairs Hard to get… (who wants to label data?) But unlabeled data is usually plentiful… Could we use this instead??????

© Daniel S. Weld 16 Co-training Have little labeled data + lots of unlabeled Each instance has two parts: x = [x1, x2] x1, x2 conditionally independent given f(x) Each half can be used to classify instance  f1, f2 such that f1(x1) ~ f2(x2) ~ f(x) Both f1, f2 are learnable f1  H1, f2  H2,  learning algorithms A1, A2 Suppose

© Daniel S. Weld 17 Without Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] f2f2 A2A2 Unlabeled Instances A1A1 f1f1 } Combine with ensemble? Bad!! Not using Unlabeled Instances! f’

© Daniel S. Weld 18 Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] Lots of Labeled Instances f2f2 Hypothesis A2A2 Unlabeled Instances A1A1 f1f1

© Daniel S. Weld 19 Observations Can apply A 1 to generate as much training data as one wants If x 1 is conditionally independent of x 2 / f(x), then the error in the labels produced by A 1 will look like random noise to A 2 !!! Thus no limit to quality of the hypothesis A 2 can make

© Daniel S. Weld 20 Co-training f 1 (x 1 ) ~ f 2 (x 2 ) ~ f(x) A 1 learns f 1 from x 1 A 2 learns f 2 from x 2 A Few Labeled Instances [x 1, x 2 ] Lots of Labeled Instances Hypothesis A2A2 Unlabeled Instances A1A1 f1f1 f2f2 f2f2 Lots of f2f2 f1f1

© Daniel S. Weld 21 It really works! Learning to classify web pages as course pages x1 = bag of words on a page x2 = bag of words from all anchors pointing to a page Naïve Bayes classifiers 12 labeled pages 1039 unlabeled Percentage error

Representing Uncertainty

© Daniel S. Weld 24 Aspects of Uncertainty Suppose you have a flight at 12 noon When should you leave for SEATAC What are traffic conditions? How crowded is security? Leaving 18 hours early may get you there But … ?

© Daniel S. Weld 25 Decision Theory = Probability + Utility Theory Min before noonP(arrive-in-time) 20 min0.05 30 min0.25 45 min0.50 60 min0.75 120 min0.98 1080 min0.99 Depends on your preferences Utility theory: representing & reasoning about preferences

© Daniel S. Weld 28 Why Should You Care? The world is full of uncertainty Logic is not enough Computers need to be able to handle uncertainty Probability: new foundation for AI (& CS!) Massive amounts of data around today Statistics and CS are both about data Statistics lets us summarize and understand it Statistics is the basis for most learning Statistics lets data do our work for us

© Daniel S. Weld 29 Outline Basic notions Atomic events, probabilities, joint distribution Inference by enumeration Independence & conditional independence Bayes’ rule Bayesian Networks Statistical Learning Dynamic Bayesian networks (DBNs) Markov decision processes (MDPs)

© Daniel S. Weld 30 Prop. Logic vs. Probability Symbol: Q, R …Random variable: Q … Boolean values: T, F Domain: you specify e.g. {heads, tails} [1, 6] State of the world: Assignment to Q, R … Z Atomic event: complete specification of world: Q… Z Mutually exclusive Exhaustive Prior probability (aka Unconditional prob: P(Q) Joint distribution: Prob. of every atomic event

Types of Random Variables

Axioms of Probability Theory Just 3 are enough to build entire theory! 1. All probabilities between 0 and 1 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is: A B A  B True

Prior and Joint Probability We will see later how any question can be answered by the joint distribution 0.2

Conditional (or Posterior) Probability Conditional or posterior probabilities e.g., P(cavity | toothache) = 0.8 i.e., given that Toothache is true (and all I know) Notation for conditional distributions: P(Cavity | Toothache) = 2-element vector of 2-element vectors (2 P values when Toothache is true and 2 when false) If we know more, e.g., cavity is also given, then we have P(cavity | toothache, cavity) = ? New evidence may be irrelevant, allowing simplification: P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8 1

Conditional Probability P(A | B) is the probability of A given B Assumes that B is the only info known. Defined as: A B ABAB True

Dilemma at the Dentist’s What is the probability of a cavity given a toothache? What is the probability of a cavity given the probe catches?

Problems with Enumeration Worst case time: O(d n ) Where d = max arity of random variables e.g., d = 2 for Boolean (T/F) And n = number of random variables Space complexity also O(d n ) Size of joint distribution Problem: Hard/impossible to estimate all O(d n ) entries for large problems

© Daniel S. Weld 48 Power of Cond. Independence Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!! Conditional independence is the most basic & robust form of knowledge about uncertain environments.

Next Up… Bayes’ Rule Bayesian Inference Bayesian Networks Bayes rules!

© Daniel S. Weld 53 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation & inference BNs provide a graphical representation of conditional independence relations in P usually quite compact requires assessment of fewer parameters, those being quite natural (e.g., causal) efficient (usually) inference: query answering and belief update

© Daniel S. Weld 54 Independence (in the extreme) If X 1, X 2,... X n are mutually independent, then P(X 1, X 2,... X n ) = P(X 1 )P(X 2 )... P(X n ) Joint can be specified with n parameters cf. the usual 2 n -1 parameters required While extreme independence is unusual, Conditional independence is common BNs exploit this conditional independence

© Daniel S. Weld 57 BNs: Qualitative Structure Graphical structure of BN reflects conditional independence among variables Each variable X is a node in the DAG Edges denote direct probabilistic influence usually interpreted causally parents of X are denoted Par(X) X is conditionally independent of all nondescendents given its parents Graphical test exists for more general independence “Markov Blanket”

© Daniel S. Weld 62 Conditional Probability Tables For complete spec. of joint dist., quantify BN For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)| If X 1, X 2,... X n is any topological sort of the network, then we are assured: P(X n,X n-1,...X 1 ) = P(X n | X n-1,...X 1 ) · P(X n-1 | X n-2,… X 1 ) … P(X 2 | X 1 ) · P(X 1 ) = P(X n | Par(X n )) · P(X n-1 | Par(X n-1 )) … P(X 1 )

© Daniel S. Weld 63 Inference in BNs The graphical independence representation yields efficient inference schemes We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence Computations organized by network topology One simple algorithm: variable elimination (VE)

© Daniel S. Weld 66 Variable Elimination A factor is a function from some set of variables into a specific value: e.g., f(E,A,N1) CPTs are factors, e.g., P(A|E,B) function of A,E,B VE works by eliminating all variables in turn until there is a factor with only query variable To eliminate a variable: join all factors containing that variable (like DB) sum out the influence of the variable on new factor exploits product form of joint distribution

© Daniel S. Weld 68 Notes on VE Each operation is a simply multiplication of factors and summing out a variable Complexity determined by size of largest factor e.g., in example, 3 vars (not 5) linear in number of vars, exponential in largest factorelimination ordering greatly impacts factor size optimal elimination orderings: NP-hard heuristics, special structure (e.g., polytrees) Practically, inference is much more tractable using structure of this sort

Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty.

Similar presentations

Presentation on theme: "Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty.

Similar presentations

Presentation on theme: "Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty."— Presentation transcript:

Similar presentations

About project

Feedback