Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim only’) 12/8/15CS 540 - Fall 2015 (Shavlik©), Lecture 30,

Slides:

Advertisements

Similar presentations

The Theory of NP-Completeness

Advertisements

What is an Algorithm? (And how do we analyze one?)

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Computational Learning Theory

Probably Approximately Correct Model (PAC)

Analysis of Algorithms CS 477/677

Algorithm Analysis CS 201 Fundamental Structures of Computer Science.

CS 4700: Foundations of Artificial Intelligence

CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.

Theoretical Approaches to Machine Learning Early work (eg. Gold) ignored efficiency Only considers computabilityOnly considers computability “Learning.

1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.

CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.

Analysis of Algorithms

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.

Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.

CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM.

Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.

Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

CS 5751 Machine Learning Chapter 10 Learning Sets of Rules1 Learning Sets of Rules Sequential covering algorithms FOIL Induction as the inverse of deduction.

Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.

Today’s Topics Playing Deterministic (no Dice, etc) Games –Mini-max –  -  pruning –ML and games? 1997: Computer Chess Player (IBM’s Deep Blue) Beat Human.

Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.

Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment) until a Week.

Today’s Topics Graded HW1 in Moodle (Testbeds used for grading are linked to class home page) HW2 due (but can still use 5 late days) at 11:55pm tonight.

1 CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman Machine Learning: The Theory of Learning R&N 18.5.

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Today’s Topics Remember: no discussing exam until next Tues! ok to stop by Thurs 5:45-7:15pm for HW3 help More BN Practice (from Fall 2014 CS 540 Final)

Today’s Topics Bayesian Networks (BNs) used a lot in medical diagnosis M-estimates Searching for Good BNs Markov Blanket what is conditionally independent.

Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.

For Monday Read chapter 4 exercise 1 No homework.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:

© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #28, Slide #1 Theoretical Approaches to Machine Learning Early work (eg.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

CS Fall 2016 (Shavlik©), Lecture 5

CS Fall 2015 (Shavlik©), Midterm Topics

CS 9633 Machine Learning Inductive-Analytical Methods

Introduction Algorithms Order Analysis of Algorithm

Computational Learning Theory

CS Fall 2016 (Shavlik©), Lecture 11, Week 6

CS Fall 2016 (© Jude Shavlik), Lecture 4

CS Fall 2016 (© Jude Shavlik), Lecture 6, Week 4

cs540- Fall 2016 (Shavlik©), Lecture 15, Week 9

Objective of This Course

cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

LECTURE 07: BAYESIAN ESTIMATION

Lecture 14 Learning Inductive inference

Inductive Learning (2/2) Version Space and PAC Learning

Implementation of Learning Systems

Early Midterm Some Study Reminders.

Presentation transcript:

Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim only’) 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 141 HW5 must be turned in by 11:55pm Fri (soln out early Sat) Read Chapters 26 and 27 of textbook for Next Tuesday Exam (comprehensive, with focus on material since midterm), Thurs 5:30-7:30pm, in this room, two pages and notes and simple calculator (log, e, * / + -) allowed Next Tues We’ll Cover My Fall 2014 Final (Spring 2013 Next Weds?) A Short Introduction to Inductive Logic Programming (ILP) – Sec of textbook - learning FOPC ‘rule sets’ - could, in a follow-up step, learn MLN weights on these rules (ie, learn ‘structure’ then learn ‘wgts’) A Short Introduction to Computational Learning Theory (COLT) – Sec 18.5 of text

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Inductive Logic Programming (ILP) Use mathematical logic to –Represent training examples (goes beyond fixed-length feature vectors) –Represent learned models (FOPC rule sets) ML work in the late ’70s through early ’90s was logic-based, then statistical ML ‘took over’ 2

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Examples in FOPC (not all have same # of ‘features’) on(ex1, block1, table)  on(ex1, block2, block1)  color(ex1, block1, blue)  color(ex1, block2, blue)  size(ex1, block1, large)  size(ex1, block2, small) PosEx1 PosEx2 Learned Concept tower(?E) if on(?E, ?A, table), on(?E, ?B, ?A). 3

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Searching for a Good Rule (propositional-logic version) P if A P if B and C P if C P if B and D P is always true P if B 4

All Possible Extensions of a Clause (capital letters are variables) Assume we are expanding this node q(X, Z)  p(X, Y) What are the possible extensions using r/3 ? r(X,X,X) r(Y,Y,Y) r(Z,Z,Z) r(1,1,1) r(X,Y,Z) r(Z,Y,X) r(X,X,Y) r(X,X,1) r(X,Y,A) r(X,A,B) r(A,A,A) r(A,B,1) and many more … Choose from: old variables, constants, new vars Huge branching factor in our search! 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 145

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Example: ILP in the Blocks World Consider this training set POS NEG 6 Can you guess an FOPC rule?

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Searching for a Good Rule (FOPC version; cap letters are vars) on(X,Y)  POS true  POS 7 blue(X)  POS tall(X)  POS Assume we have: tall(X), wide(Y), square(X), on(X,Y), red(X), green(X), blue(X), block(X) … POSSIBLE RULE LEARNED: If on(X,Y)  block(Y)  blue(X)  POS - hard to learn with fixed-length feature vectors! + -

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Covering Algorithms (learn a rule, then recur; so disjunctive) Examples covered by Rule 1 Examples Still to Cover; use to learn Rule 2 8

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Using Background Knowledge (BK) in ILP Now consider adding some domain knowledge about the task being learned For example If Q, R, and W are all true Then you can infer Z is true Can also do arithmetic, etc in BK rule bodies If SOME_TRIG_CALCS_OUTSIDE_OF_LOGIC Then openPassingLane(P1, P2, Radius, Angle) 9

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Searching for a Good Rule using Deduced Features (eg, Z) P if AP if CP if B P if Z P if B & Z Note that more BK can lead to slower learning! But hopefully less search depth needed P is always true P if B and DP if B and C 10

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Controlling the Search for a Good Rule Choose a ‘seed’ positive example, then only consider properties that are true about this example Specify argument types and whether arguments are ‘input’ (+) or ‘output’ (-) –Only consider adding a literal if all of its input arguments already present in rule –For example enemies(+person, -person) Only if a variable of type PERSON is already in the rule [eg, murdered(person)], consider adding that person’s enemies 11

12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 Formal Specification of the ILP Task Givena set of pos examples (P) a set of neg examples (N) some background knowledge (BK) Doinduce additional knowledge (AK) such that BK  AK allows all/most in P to be proved BK  AK allows none/few in N to be proved Technically, the BK also contains all the facts about the pos and neg examples plus some rules 12

ILP Wrapup Use best-first search with a large beam Commonly used scoring function #posExCovered - #negExCoved – ruleLength Performs ML without requiring fixed-length-feature-vectors Produces human-readable rules (straightforward to convert FOPC to English) Can be slow due to large search space Appealing ‘inner loop’ for prob logic learning 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 14 13

COLT: Probably Approximately Correct (PAC) Learning PAC theory of learning (Valiant ’84) Given C class of possible concepts c  Ctarget concept Hhypothesis space (usually H = C) ,  correctness bounds Npolynomial number of examples 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1414

Probably Approximately Correct (PAC) Learning Do with probability 1 - , return an h in H whose accuracy is at least 1 -  Do this for any probability distribution for the examples In other words Prob[error(h, c) >  ] <  h c Shaded regions are where errors occur 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1415

How Many Examples Needed to be PAC? Consider finite hypothesis spaces Let H bad  { h 1, …, h z } The set of hypotheses whose (‘testset’) error is >  Goal With high prob, eliminate all items in H bad via (noise-free) training examples 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1416

How Many Examples Needed to be PAC? How can an h look bad, even though it is correct on all the training examples? If we never see any examples in the shaded regions We’ll compute an N s.t. the odds of this are sufficiently low (recall, N = number of examples) h c 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1417

H bad Consider H 1  H bad and ex  { N } What is the probability that H 1 is consistent with ex ? Prob[consistent A (ex, H 1 )] ≤ 1 -  (since H 1 is bad its error rate is at least  ) The set of N examples 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1418

H bad (cont.) What is the probability that H 1 is consistent with all N examples? Prob[consistent B ({ N }, H 1 )] ≤ (1 -  ) |N| (by iid assumption) 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1419

H bad (cont.) What is the probability that some member of H bad is consistent with the examples in { N } ? Prob[consistent C ({N}, H bad )]  Prob[consistent B ({N}, H 1 )  …  consistent B ({N}, H z )] ≤ |H bad | x (1-  ) |N| // P(A  B) = P(A) + P(B) - P(A  B) ≤ |H| x (1-  ) |N| // H bad  H Ignore this in upper bound calc 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1420

Solving for #Examples, |N| We want Prob[consistent C ({N}, H bad )] ≤ |H| x (1-  ) |N| <  Recall that we want the prob of a bad concept surviving to be less than , our bound on learning a poor concept Assume that if many consistent hypotheses survive, we get unlucky and choose a bad one (we’re doing a worst-case analysis) 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1421

Solving for |N| (number of examples needed to be confident of getting a good model) Solving |N| > [ log(1/  ) + log(|H|) ] / -ln(1-  ) Since  ≤ -log(1-  ) over [0,1) we get |N| > [ log(1/  ) + log(|H|) ] /  (Aside: notice that this calculation assumed we could always find a hypothesis that fits the training data) Notice we made NO assumptions about the prob dist of the data (other than it does not change) 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1422

Example: Number of Instances Needed Assume F = 100 binary features H = all (pure) conjuncts [3 |F| possibilities (  i, use f i, use ¬ f i, or ignore f i ) so log |H| = |F|  log 3 ≈ |F| ]  = 0.01  = 0.01 N = [log(1/  )+log(|H|)] /  = 100  [log(100) + 100] ≈ 10 4 But how many real-world concepts are pure conjuncts with noise-free training data? 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1423

Agnostic Learning So far we’ve assumed we knew the concept class - but that is unrealistic on real-world data In agnostic learning we relax this assumption We instead aim to find a hypothesis arbitrarily close (ie <  error) to the best* hypothesis in our hypothesis space We now need |N| ≥ [ log(1/  ) + log(|H|) ] / 2  2 (denominator had been just  before) * ie, closest to the true concept 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1424

Two Senses of Complexity Sample complexity (number of examples needed) vs. Time complexity (time needed to find h  H that is consistent with the training examples) - in CS, we usually only address time complexity 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1425

Complexity (cont.) –Some concepts require a polynomial number of examples, but an exponential amount of time (in the worst case) –Eg, optimally training neural networks is NP-hard (recall BP is a ‘greedy’ algorithm that finds a local min) 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1426

Some Other COLT Topics COLT + clustering + k-NN + RL + SVMs + ANNs + ILP, etc. Average case analysis (vs. worst case) Learnability of natural languages (language innate?) Learnability in parallel 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1427

Summary of COLT Strengths Formalizes learning task Allows for imperfections (eg,  and  in PAC) Work on boosting excellent case of ML theory influencing ML practice Shows what concepts are intrinsically hard to learn (eg, k-term DNF*) * though a superset of this class is PAC learnable! 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1428

Summary of COLT Weaknesses Most analyses are worst case Hence, bounds often much higher than what works in practice (see Domingos article assigned early this semester) Use of ‘prior knowledge’ not captured very well yet 12/8/15CS Fall 2015 (Shavlik©), Lecture 30, Week 1429