Presentation is loading. Please wait.

Presentation is loading. Please wait.

I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.

Similar presentations


Presentation on theme: "I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important."— Presentation transcript:

1 I NTRODUCTION TO M ACHINE L EARNING

2 L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important (e.g., in basic science) For inference (e.g., forecasting) To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …

3 L AST TIME Going from observed data to unknown hypothesis 3 types of statistical learning techniques Bayesian inference Maximum likelihood Maximum a posterior Applied to learning: Candy bag example (5 discrete hypotheses) Coin flip probability (infinite hypotheses from 0 to 1)

4 B AYESIAN V IEW OF L EARNING P(h i | d ) =  P( d |h i ) P(h i ) is the posterior (Recall, 1/  = P( d ) =  i P( d |h i ) P(h i )) P( d |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

5 B AYESIAN VS. M AXIMUM L IKELIHOOD VS M AXIMUM A P OSTERIORI Bayesian reasoning requires thinking about all hypotheses ML and MAP just try to get the “best” ML ignores prior information MAP uses it Smoothes out the estimate for small datasets All are asymptotically equivalent given large enough datasets P(X|h ML ) P(X|d) P(X|h MAP )

6 L EARNING BERNOULLI DISTRIBUTIONS Example data ML estimates ABC# obs 1113 1105 1011 10010 0114 0107 0016 0007 P(C|AB) A,B  1 =3/8 A,  B  2 =1/11  A,B  3 =4/11  A,  B  4 =6/13

7 M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data Alarm EarthquakeBurglar E 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

8 M AXIMUM A P OSTERIORI WITH B ETA P RIORS Example data MAP Estimates assuming Beta prior with  =  =3 virtual counts 2H,2T ABC# obs 1113 1105 1011 10010 0114 0107 0016 0007 P(C|AB) A,B  1 =(3+2)/(8+4) A,  B  2 =(1+2)/(11+4)  A,B  3 =(4+2)/(11+4)  A,  B  4 =(6+2)/(13+4)

9 T OPICS IN M ACHINE L EARNING Applications Document retrieval Document classification Data mining Computer vision Scientific discovery Robotics … Tasks & settings Classification Ranking Clustering Regression Decision-making Supervised Unsupervised Semi-supervised Active Reinforcement learning Techniques Bayesian learning Decision trees Neural networks Support vector machines Boosting Case-based reasoning Dimensionality reduction …

10 W HAT IS L EARNING ?  Mostly generalization from experience: “Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future” M.R. Genesereth and N.J. Nilsson, in Logical Foundations of AI, 1987   Concepts, heuristics, policies  Supervised vs. un-supervised learning

11 I NDUCTIVE L EARNING Basic form: learn a function from examples f is the unknown target function An example is a pair ( x, f(x) ) Problem: find a hypothesis h such that h ≈ f given a training set of examples D Instance of supervised learning Classification task: f  {0,1,…,C} (usually C=1) Regression task: f  reals

12 I NDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:

13 I NDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:

14 I NDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:

15 I NDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:

16 I NDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting:

17 I NDUCTIVE LEARNING METHOD Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting: h=D is a trivial, but perhaps uninteresting solution (caching)

18 C LASSIFICATION T ASK  The target function f(x) takes on values True and False  A example is positive if f is True, else it is negative  The set X of all examples is the example set  The training set is a subset of X a small one!

19 L OGIC -B ASED I NDUCTIVE L EARNING Here, examples (x, f(x)) take on discrete values

20 L OGIC -B ASED I NDUCTIVE L EARNING Here, examples (x, f(x)) take on discrete values Concept Note that the training set does not say whether an observable predicate is pertinent or not

21 R EWARDED C ARD E XAMPLE  Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”  Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s)  Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])   REWARD([5,H])   REWARD([J,S])

22 R EWARDED C ARD E XAMPLE  Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”  Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s)  Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])   REWARD([5,H])   REWARD([J,S])  Possible inductive hypothesis: h  (NUM(r)  BLACK(s)  REWARD([r,s])) There are several possible inductive hypotheses

23 L EARNING A L OGICAL P REDICATE (C ONCEPT C LASSIFIER )  Set E of objects (e.g., cards)  Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD)  Observable predicates A(x), B(X), … (e.g., NUM, RED)  Training set: values of CONCEPT for some combinations of values of the observable predicates

24 L EARNING A L OGICAL P REDICATE (C ONCEPT C LASSIFIER )  Set E of objects (e.g., cards)  Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD)  Observable predicates A(x), B(X), … (e.g., NUM, RED)  Training set: values of CONCEPT for some combinations of values of the observable predicates  Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x)  A(x)  (  B(x) v C(x))

25 H YPOTHESIS S PACE  An hypothesis is any sentence of the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built using the observable predicates  The set of all hypotheses is called the hypothesis space H  An hypothesis h agrees with an example if it gives the correct value of CONCEPT

26 + + + + + + + + + + + + - - - - - - - - - - - - Example set X {[A, B, …, CONCEPT]} I NDUCTIVE L EARNING S CHEME Hypothesis space H {[CONCEPT(x)  S(A,B, …)]} Training set D Inductive hypothesis h

27 S IZE OF H YPOTHESIS S PACE n observable predicates 2 n entries in truth table defining CONCEPT and each entry can be filled with True or False In the absence of any restriction (bias), there are hypotheses to choose from n = 6  2x10 19 hypotheses! 2 2n2n

28 h 1  NUM(r)  BLACK(s)  REWARD([r,s]) h 2  BLACK(s)   (r=J)  REWARD([r,s]) h 3  ([r,s]=[4,C])  ([r,s]=[7,C])  [r,s]=[2,S])  REWARD([r,s]) h 4   ([r,s]=[5,H])   ([r,s]=[J,S])  REWARD([r,s]) agree with all the examples in the training set M ULTIPLE I NDUCTIVE H YPOTHESES

29 h 1  NUM(r)  BLACK(s)  REWARD([r,s]) h 2  BLACK(s)   (r=J)  REWARD([r,s]) h 3  ([r,s]=[4,C])  ([r,s]=[7,C])  [r,s]=[2,S])  REWARD([r,s]) h 4   ([r,s]=[5,H])   ([r,s]=[J,S])  REWARD([r,s]) agree with all the examples in the training set M ULTIPLE I NDUCTIVE H YPOTHESES Need for a system of preferences – called an inductive bias – to compare possible hypotheses

30 N OTION OF C APACITY  It refers to the ability of a machine to learn any training set without error  A machine with too much capacity is like a botanist with photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything he has seen before  A machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree  Good generalization can only be achieved when the right balance is struck between the accuracy attained on the training set and the capacity of the machine

31  K EEP -I T -S IMPLE (KIS) B IAS  Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high- level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax  Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller

32  K EEP -I T -S IMPLE (KIS) B IAS  Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high- level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax  Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller Einstein: “A theory must be as simple as possible, but not simpler than this”

33  K EEP -I T -S IMPLE (KIS) B IAS  Examples Use much fewer observable predicates than the training set Constrain the learnt predicate, e.g., to use only “high- level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax  Motivation If an hypothesis is too complex it is not worth learning it (data caching does the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller If the bias allows only sentences S that are conjunctions of k << n predicates picked from the n observable predicates, then the size of H is O(n k )

34 C APACITY IS N OT THE O NLY C RITERION Accuracy on training set isn’t the best measure of performance + + + + + + + + + + + + - - - - - - - - - - - - Learn Test Example set XHypothesis space H Training set D

35 G ENERALIZATION E RROR A hypothesis h is said to generalize well if it achieves low error on all examples in X + + + + + + + + + + + + - - - - - - - - - - - - Learn Test Example set XHypothesis space H

36 A SSESSING P ERFORMANCE OF A L EARNING A LGORITHM Samples from X are typically unavailable Take out some of the training set Train on the remaining training set Test on the excluded instances Cross-validation

37 C ROSS -V ALIDATION Split original set of examples, train + + + + + + + - - - - - - + + + + + - - - - - - Hypothesis space H Train Examples D

38 C ROSS -V ALIDATION Evaluate hypothesis on testing set + + + + + + + - - - - - - Hypothesis space H Testing set

39 C ROSS -V ALIDATION Evaluate hypothesis on testing set Hypothesis space H Testing set ++ + + + - - - - - - + + Test

40 C ROSS -V ALIDATION Compare true concept against prediction + + + + + + + - - - - - - Hypothesis space H Testing set ++ + + + - - - - - - + + 9/13 correct

41 T ENNIS E XAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind)

42 T ENNIS E XAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool)  (W=Weak) Training errors = 3/10 Testing errors = 4/4

43 T ENNIS E XAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 1/4

44 T ENNIS E XAMPLE Evaluate learning algorithm PlayTennis = S(Temperature,Wind) Trained hypothesis PlayTennis = (T=Mild or Cool) Training errors = 3/10 Testing errors = 2/4

45 T EN C OMMANDMENTS OF MACHINE LEARNING Thou shalt not: Train on examples in the testing set Form assumptions by “peeking” at the testing set, then formulating inductive bias

46 S UPERVISED L EARNING F LOW C HART Training set Target function Datapoints Inductive Hypothesis Prediction Learner Hypothesis space Choice of learning algorithm Unknown concept we want to approximate Observations we have seen Test set Observations we will see in the future Better quantities to assess performance

47 H OW TO CONSTRUCT A BETTER LEARNER ? Ideas?

48 P REDICATE AS A D ECISION T REE The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED

49 P REDICATE AS A D ECISION T REE The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY

50 T RAINING S ET Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

51 P OSSIBLE D ECISION T REE D CE B E AA A T F F FF F T T T TT

52 D CE B E AA A T F F FF F T T T TT CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A)))))) A? B? C? True FalseTrue False CONCEPT  A  (  B v C)

53 P OSSIBLE D ECISION T REE D CE B E AA A T F F FF F T T T TT A? B? C? True FalseTrue False CONCEPT  A  (  B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A))))))

54 G ETTING S TARTED : T OP -D OWN I NDUCTION OF D ECISION T REE Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is:

55 G ETTING S TARTED : T OP -D OWN I NDUCTION OF D ECISION T REE True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

56 A SSUME I T ’ S A A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise  The number of misclassified examples from the training set is 2

57 A SSUME I T ’ S B B True: False: 9, 10 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise  The number of misclassified examples from the training set is 5 6, 7, 8, 13

58 A SSUME I T ’ S C C True: False: 6, 8, 9, 10, 13 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise  The number of misclassified examples from the training set is 4 7

59 A SSUME I T ’ S D D T F If we test only D, we will report that CONCEPT is True if D is True and False otherwise  The number of misclassified examples from the training set is 5 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

60 A SSUME I T ’ S E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7

61 A SSUME I T ’ S E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7 So, the best predicate to test is A

62 C HOICE OF S ECOND P REDICATE A T F C True: False: 6, 8, 9, 10, 13 11, 12 7 T F False  The number of misclassified examples from the training set is 1

63 C HOICE OF T HIRD P REDICATE C T F B True: False: 11,12 7 T F A T F False True

64 F INAL T REE A C True B False CONCEPT  A  (C v  B) CONCEPT  A  (  B v C) A? B? C? True False True False

65 T OP -D OWN I NDUCTION OF A DT DTL( , Predicates) 1. If all examples in  are positive then return True 2. If all examples in  are negative then return False 3. If Predicates is empty then return failure 4. A  error-minimizing predicate in Predicates 5. Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Subset of examples that satisfy A

66 T OP -D OWN I NDUCTION OF A DT DTL( , Predicates) 1. If all examples in  are positive then return True 2. If all examples in  are negative then return False 3. If Predicates is empty then return failure 4. A  error-minimizing predicate in Predicates 5. Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Noise in training set! May return majority rule, instead of failure

67 C OMMENTS Widely used algorithm Greedy Robust to noise (incorrect examples) Not incremental

68 M ISCELLANEOUS I SSUES Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve

69 M ISCELLANEOUS I SSUES Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve Some concepts are unrealizable within a machine’s capacity

70 M ISCELLANEOUS I SSUES Assessing performance: Training set and test set Learning curve Overfitting Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set size of training set % correct on test set 100 Typical learning curve

71 M ISCELLANEOUS I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set Terminate recursion when # errors / information gain is small

72 M ISCELLANEOUS I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Terminate recursion when # errors / information gain is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set

73 M ISCELLANEOUS I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Incorrect examples Missing data Multi-valued and continuous attributes

74 C ONTINUOUS A TTRIBUTES Continuous attributes can be converted into logical ones via thresholds X => X<a When considering splitting on X, pick the threshold a to minimize # of errors 7765654543454567

75 L EARNABLE C ONCEPTS Some simple concepts cannot be represented compactly in DTs Parity(x) = X 1 xor X 2 xor … xor X n Majority(x) = 1 if most of X i ’s are 1, 0 otherwise Exponential size in # of attributes Need exponential # of examples to learn exactly The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

76 A PPLICATIONS OF D ECISION T REE Medical diagnostic / Drug design Evaluation of geological systems for assessing gas and oil basins Early detection of problems (e.g., jamming) during oil drilling operations Automatic generation of rules in expert systems

77 H UMAN -R EADABILITY DTs also have the advantage of being easily understood by humans Legal requirement in many areas Loans & mortgages Health insurance Welfare


Download ppt "I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important."

Similar presentations


Ads by Google