Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Gun Ho Lee Soongsil University, Seoul
2 Outline Learning from examples General-to-specific ordering over hypotheses Version spaces and candidate elimination algorithm Picking new examples The need for inductive bias Note: simple approach assuming no noise, illustrates key concepts
3 A Concept Examples of Concepts –“birds”, “car”, “situations” in which I should study more in order to pass the exam” Concept –Some subset of objects or events defined over a larger set, or –A boolean-valued function defined over this larger set. –Concept “birds” is the subset of animals that constitute birds.
4 Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 A Concept Let there be a set of objects, X. X = { 백구, 야옹이, 도그, 강세이 } A concept C is… –A subset of X C = dogs = { 백구, 도그, 강세이 } –A function that returns 1 only for elements in the concept C( 백구 ) = 1, C( 야옹이 ) = 0 4
5 Instance Representation Represent an object (or instance) as an n-tuple of attributes Example: Days (6-tuples) 5 Instance Sky AirTemp Humidity Wind Water Forecast EnjoySport 1 Sunny Warm Normal Strong Warm Same No 2 Sunny Warm High Strong Warm Same Yes 3 Rainy Cold High Strong Warm Change No 4 Sunny Warm High Strong Cool Change Yes
6 Concept Learning Learning –Inducing general functions from specific training examples Concept learning –Acquiring the definition of a general category given a sample of positive and negative training examples of the category –Inferring a boolean-valued function from training examples of its input and output.
7 A Concept Learning Task Target concept EnjoySport –“days on which 박지성 enjoys water sport” Hypothesis –A vector of 6 constraints, specifying the values of the six attributes Sky, AirTemp, Humidity, Wind, Water, Forecast. – the hypothesis : 박지성은 일기상태 (cold, high humidity) 에서 수상 스포츠를 즐긴다. Sky, AirTemp, Humidity, Wind, Water, Forecast
8 Representing Hypotheses Many possible representations Here, h is conjunction of constraints on attributes Each constraint can be a specific value (e.g., Water = Warm) don’t care (e.g., “Water =?”) no value allowed (e.g., “Water=Φ”) For example, Sky AirTemp Humid Wind Water Forecst
9 Task: Learn a hypothesis from a dataset
10 Example Concept Function “Days on which my friend 박지성 enjoys his favorite water sport” SkyTempHumidWindWaterForecastC(x) sunnywarmnormalstrongwarmsame1 sunnywarmhighstrongwarmsame1 rainycoldhighstrongwarmchange0 sunnywarmhighstrongcoolchange1 INPUT OUTPUT 10
11 The Learning Task Given: –Hypotheses space H: conjunction of constraints on attributes. E.g. conjunction of literals: –Target concept c: E.g., EnjoySport X {0,1} –Instances X: set of items over which the concept is defined. E.g., days decribed by attributes: Sky, Temp, Humidity, Wind, Water, Forecast Training examples (positive/negative): Training set D: positive, negative examples of the target function:,…, Determine: – A hypothesis h in H such that h(x) = c(x), for all x in X
12 Assumption 1 We will explore the space of all conjunctions. We assume the target concept falls within this space. Target concept c H, Hypotheses space
13 Assumption 2 A hypothesis close to target concept c obtained after seeing many training examples will result in high accuracy on the set of unobserved examples. Training set D Hypothesis h is good Complement set D’ Hypothesis h is good Inductive learning hypothesis
14 Inductive Learning Hypothesis Learning task is to determine h identical to c over the entire set of instances X. But the only information about c is its value over D (training set). Inductive learning algorithms can at best guarantee that the induced h fits c over D. Inductive learning hypothesis –Any good hypothesis over a sufficiently large set of training examples will also approximate the target function well over unseen examples.
15 Concept Learning as Search Search –Find a hypothesis that best fits training examples –Efficient search in hypothesis space (finite/infinite) Search space in EnjoySport Sky has 3 (Sunny, Cloudy, and Rainy) Temp has 2 (Warm and Cold) Humidity has 2 (Normal and High) Wind has 2 (Strong and Weak) Water has 2 (Warm and Cool) Forecast has 2 (Same and Change) –3x2x2x2x2x2 = 96 distinct instances –5x4x4x4x4x4 = 5120 syntactically distinct hypotheses within H (considering Φ and ? in addition)
16 Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 Mammal Concept Generality A concept P is more general than or equal to another concept Q iff the set of instances represented by P includes the set of instances represented by Q. Canine Wolf Pig Dog White_fang Lassie Scooby_doo Wilbur Charlotte 16
17 Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 General to Specific Order Consider two hypotheses: –h 1 = –h 2 = Definition: h j is more general than or equal to h k iff: This imposes a partial order on a hypothesis space. 17
18 Instance, Hypotheses, and More- General-Than The Most General Hypothesis : The Most Specific Hypothesis : Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 x 1 = x 2 = h 1 = h 2 = h 3 = Instances x2x2 x1x1 Hypotheses h2h2 h3h3 h1h1 h 2 h 1 h 2 h 3 specific general
19 Find-S Algorithm 1. Initialize h to the most specific hypothesis in H 2. For each positive training instance x –For each attribute constraint a i in h If the constraint a i in h is satisfied by x Then do nothing Else replace a i in h by the next more general constraint satisfied by x 3. Output hypothesis h Finding a Maximally Specific Hypothesis
20 Hypothesis Space Search by Find-S InstancesHypotheses specific general h0h0 h 0 = h1h1 x 1 = + x1x1 h 1 = x 3 = - x3x3 h 2,3 x 2 = + x2x2 h 2,3 = h4h4 x 4 = + x4x4 h 4 =
21 Find-S Properties of Find-S Ignores every negative example (no revision to h required in response to negative examples). Guaranteed to output the most specific hypothesis consistent with the positive training examples (for conjunctive hypothesis space). Final h also consistent with negative examples provided the target c is in H and no error in D.
22 Weaknesses of Find-S Has the learner converged to the correct target concept ? No way to know whether the solution is unique. Why prefer the most specific hypothesis? How about the most general hypothesis? Are the training examples consistent ? Training sets containing errors or noise can severely mislead the algorithm Find-S. What if there are several maximally specific consistent hypotheses? No backtrack to explore a different branch of partial ordering.
23 Partial order of hypotheses I
24 Partial order of hypotheses II
25 Partial order of hypotheses III
26 The space of hypotheses
27 The space of hypotheses I
28 The space of hypotheses II
29 The space of hypotheses III
30 A hypothesis h is consistent with a set of training examples D of target concept c if and only if h(x) = c(x) for each training example in D. Consistent(h, D) ≡ ( ∀ ∈ D) h(x) = c(x)
31 The version space, VS H,D, with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with all training examples in D. VS H,D ≡ {h ∈ H | Consistent(h, D)} Version Space
32 Version Spaces A hypothesis h is consistent with a set of training examples D of target concept c if and only if h(x) = c(x) for each training example in D. Consistent(h, D) ≡ ( ∀ ∈ D) h(x) = c(x) The version space, V S H,D, with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with all training examples in D. VS H,D ≡ {h ∈ H | Consistent(h, D)}
33 The List-Then-Eliminate Algorithm: 1.VersionSpace a list containing every hypothesis in H 2. For each training example, remove from VersionSpace any hypothesis h for which h(x) c(x) 3. Output the list of hypotheses in VersionSpace
34 Drawbacks of List-Then-Eliminate The algorithm requires exhaustively enumerating all hypotheses in H –An unrealistic approach ! (full search) If insufficient (training) data is available, the algorithm will output a huge set of hypotheses consistent with the observed data – 학습 data 가 불충분할때 지나치게 많은 hypotheses 를 양산 !!
35 Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 Example Version Space { } S: {,, } G: x 1 = + x 2 = + x 3 = - x 4 = +
36 Representing Version Spaces The General boundary, G, of version space VS H,D is the set of its maximally general members The Specific boundary, S, of version space VS H,D is the set of its maximally specific members Every member of the version space lies between these boundaries VS H,D = {h ∈ H | ( ∃ s ∈ S)( ∃ g ∈ G) ( g ≥ h ≥ s )} where x ≥ y means x is more general or equal to y
37 Relevant bounds
38 Basic Idea of Candidate Elimination Algorithm 1.Initialize G to the set of maximally general hypotheses in H 2.Initialize S to the set of maximally specific hypotheses in H 3.For each training example x, do If x is positive: generalize S if necessary If x is negative: specialize G if necessary
39 Candidate Elimination Algorithm (1/2) G ← maximally general hypotheses in H S ← maximally specific hypotheses in H For each training example d, do If d is a positive example –G 로 부터 학습 data d 와 일관하지 않는 가설은 제거한다. – 학습 data d 와 일관하지 않는 각 가설 s(s ∈ S) 에 대하여 Remove s from S 다음과 같이 최소로 일반화된 가설 h(all minimal generalizations h) 를 S 에 추가한다. 1. 가설 h 가 학습 data d 와 일관하고 2. G 의 일부가 가설 h 보다 일반화 (general) 되어 있는 경우 만약, S 의 다른 가설 보다 더 일반화 된 어떤 가설이 있다면 삭제한다. (Specific boundary 유지 ) h G: S: inconsistent with d from G Add minimal generalizations
40 Candidate Elimination Algorithm (2/2) If d is a negative example –S 로 부터 학습 data d 와 일관하지 않는 가설은 삭제한다 – 학습 data d 와 일관하지 않는 가설 g(g ∈ G) Remove g from G 다음과 같이 최소로 특수화된 모든 가설 h(all minimal specializations h) 를 G 에 추가 한다. 1. 가설 h 가 학습 data d 에 일관하고, 2. S 의 일부가 가설 h 보다 더 특수화 되어 있는 만약, G 의 다른 가설 보다 덜 일반화 ( less general) 된 어떤 가설이 있다 면 G 에서 삭제한다. (General Boundary 유지 ) h G: S: inconsistent with d Add minimal specializations
41 Candidate-Elimination Algorithm –When does this halt? –If S and G are both singleton sets, then: if they are identical, output value and halt. if they are different, the training cases were inconsistent. Output this and halt. –Else continue accepting new training examples.
42 Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 Example Candidate Elimination Instance space: integer points in the x,y plane with 0 x,y 10 hypothesis space : rectangles, that means hypotheses are of the form a x b, c y d, assume ab c d 42
43 Example Candidate Elimination examples = { ø } G= {a, b, c, d} = {0,10,0,10} S={ ø } G
44 Example Candidate Elimination examples = {(3,4),+} G={(0,10,0,10)} S={(3,3,4,4)} G + S
45 Example Trace First initialize the S and G sets: S 0 : 0,0,0,0,0,0> G 0 : { }
46 Given 1st Example The first example is positive:, Yes> h:, S 0 : 0,0,0,0,0,0> S 1 : { } G 0, G 1 : { } For each training example d, do If d is a positive example –G 로 부터 학습 data d 와 일관하지 않는 가설은 제거한다. – 학습 data d 와 일관하지 않는 각 가설 s(s ∈ S) 에 대하여 Remove s from S 다음과 같이 최소로 일반화된 가설 h(all minimal generalizations h) 를 S 에 추가한다. 1. 가설 h 가 학습 data d 와 일관하고 2. G 의 일부가 가설 h 보다 일반화 (general) 되어 있는 경우 만약, S 의 다른 가설 보다 더 일반화 된 어떤 가설이 있다면 삭제한다.(Specific boundary 유지 ) (g ≥ h ≥ s) ? Candidate Elimination Algorithm
47 Given 2nd Example The second example is positive:, Yes> S 1 : { } S 2 : { } G 1, G 2 : { } For each training example d, do If d is a positive example –G 로 부터 학습 data d 와 일관하지 않는 가설은 제거한다. – 학습 data d 와 일관하지 않는 각 가설 s(s ∈ S) 에 대하여 Remove s from S 다음과 같이 최소로 일반화된 가설 h(all minimal generalizations h) 를 S 에 추가한다. 1. 가설 h 가 학습 data d 와 일관하고 2. G 의 일부가 가설 h 보다 일반화 (general) 되어 있는 경우 만약, S 의 다른 가설 보다 더 일반화 된 어떤 가설이 있다면 삭제한다.(Specific boundary 유지 ) h: (g ≥ h ≥ s) ? Candidate Elimination Algorithm
48 Given 3rd Example The third example is negative:, No> S 2, S 3 : { } G 3 : {,, } G 2 : { } If d is a negative example –S 로 부터 학습 data d 와 일관하지 않는 가설은 삭제한다 – 학습 data d 와 일관하지 않는 가설 g(g ∈ G) Remove g from G 다음과 같이 최소로 특수화된 모든 가설 h(all minimal specializations h) 를 G 에 추 가 한다. 1. 가설 h 가 학습 data d 에 일관하고, 2. S 의 일부가 가설 h 보다 더 특수화 되어 있는 만약, G 의 다른 가설 보다 덜 일반화 ( less general) 된 어떤 가설이 있다면 G 에서 삭제한다. (General Boundary 유지 ) (g ≥ h ≥ s) ? Candidate Elimination Algorithm
Given 3rd Example 49 The third example is negative:, No> S 2, S 3 : { } 가능한 가설 h 들 {,, },,, } G 3 : {,, }, 은 가능 ?,, 은 불가능 ? G 2 : { } Candidate Elimination Algorithm
Given 3rd Example 50 따라서 S 로부터 삭제할 가설은 없다 !! If d is a negative example –S 로 부터 학습 data d 와 일관하지 않는 가설은 삭제한다 Candidate Elimination Algorithm
Given 3rd Example 51 If d is a negative example – 학습 data d 와 일관하지 않는 가설 g(g ∈ G) Remove g from G 다음과 같이 최소로 특수화된 모든 가설 h(all minimal specializations h) 를 G 에 추가 한다. 1. 가설 h 가 학습 data d 에 일관하고, 2. S 의 일부가 가설 h 보다 더 특수화 되어 있는 Candidate Elimination Algorithm
Why is not included in G ? Given 3rd Example 학습 data d :, No> S 2, S 3 : { } 가설 h:, yes 의 경우 h(x)=no, d(x) =no ≥ 인가 ? VS H,D = {h ∈ H | ( ∃ s ∈ S)( ∃ g ∈ G) (g ≥ h ≥ s)} where x ≥ y means x is more general or equal to y Candidate Elimination Algorithm
53 Given 3rd Example The third example is negative:, No> S 2, S 3 : { } G 3 : {,, } G 2 : { } If d is a negative example –S 로 부터 학습 data d 와 일관하지 않는 가설은 삭제한다 – 학습 data d 와 일관하지 않는 가설 g(g ∈ G) Remove g from G 다음과 같이 최소로 특수화된 모든 가설 h(all minimal specializations h) 를 G 에 추 가 한다. 1. 가설 h 가 학습 data d 에 일관하고, 2. S 의 일부가 가설 h 보다 더 특수화 되어 있는 만약, G 의 다른 가설 보다 덜 일반화 ( less general) 된 어떤 가설이 있다면 G 에서 삭제한다. (General Boundary 유지 ) Candidate Elimination Algorithm
54 Given 4th Example The 4th example is negative: S 3 : { } S 4 : { } G 4 : {, } G 3 : {,, } (g ≥ h ≥ s) ? Candidate Elimination Algorithm
55 S2S2 = G 2 Example Trace S0S0 G0G0 d 1 : d 2 : d 3 : d 4 : = S 3 G3G3 S4S4 G4G4 S1S1 = G 1 Candidate Elimination Algorithm
56 The hypothesis space after four cases Candidate Elimination Algorithm
57 The hypothesis space after four cases Candidate Elimination Algorithm
58 Remarks on Candidate Elimination What training example should the learner request next? Will the CE algorithm converge to the correct hypothesis ? How can partially learned concepts be used? Candidate Elimination Algorithm
59 What Next Training Example?
60 Who Provides Examples? Two methods –Fully supervised learning: External teacher provides all training examples (input + correct output) –Learning by query: The learner generates instances (queries) by conducting experiments, then obtains the correct classification for this instance from an external oracle (nature or a teacher). Negative training examples specializes G, positive ones generalize S.
61 When Does CE Converge? Will the Candidate-Elimination algorithm converge to the correct hypothesis? Prerequisites – 1. No error in training examples – 2. The target hypothesis exists which correctly describes c(x). If S and G boundary sets converge to an empty set, this means there is no hypothesis in H consistent with observed examples. (S 와 G 모두 empty set 에 이르게 된다면 학습된 예제들과는 부합하는 hypothesis 이 없다는 것을 의미한다.)
62 Can a partially learned classifier be used?
63 How to Use Partially Learned Concepts? t Suppose the learner is asked to classify the four new instances shown in the following table. Instance Sky AirTemp Humidity Wind Water Forecast EnjoySport A Sunny Warm Normal Strong Cool Change + 6/0 B Rainy Cold Normal Light Warm Same - 0/6 C Sunny Warm Normal Light Warm Same ? 3/3 D Sunny Cold Normal Strong Warm Same ? 2/4 { } S: {,, } G:
64 Can a partially learned classifier be used?
A Biased Hypothesis Space 65 x 1 = + x 2 = + x 3 = - S 2 : { } overly gerenal !!, incorrectly covers x 3 S 3 : {} The third example x 3 contradicts the already overly general hypothesis space specific boundary S 2. We have Biased the learner to consider only conjunctive hypothesis !!
66 An UnBiased Learner Idea: Choose H that expresses every teachable concept (i.e., H is the power set of X) Consider H' = disjunctions, conjunctions, negations over previous H. E.g., ∨ What are S, G in this case? S ← G ← target concept 들이 H 내에 존재하기 위해서 는 모든 teachable concept 을 표현이 가능하 도록 해야 함 !!
67 Unbiased Learner Assume positive examples (x 1, x 2, x 3 ) and negative examples (x 4, x 5 ) S : { (x 1 v x 2 v x 3 ) } G : { (x 4 v x 5 ) } How would we classify some new instance x 6 ? For any instance not in the training examples half of the version space says + the other half says – => To learn the target concept, one would have to present every single instance in X as a training example (Rote learning)
68 What Justifies this Inductive Leap? New examples d1:, Yes d2:, No S : { +} Overly general hypothesis space
69 Inductive bias Our hypothesis space is unable to represent a simple disjunctive target concept : (Sky=Sunny) v (Sky=Cloudy) x 1 = + S 1 : { } x 2 = + S 2 : { } x 3 = - S 3 : {} The third example x 3 contradicts the already overly general hypothesis space specific boundary S 2. Overly general hypothesis space
70 Why believe we can classify the unseen ? S : { +} Unseen example:, …. Why Inductive learning hypothesis ?
71 Why believe we can classify the unseen ? S : { +} Unseen example:, …. Why Inductive learning hypothesis ? Inductive learning hypothesis: “If the hypothesis works for enough data then it will work on new examples.”
72 Inductive bias The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered (Mitchell, 1980). Example –Occam’s Razor –Target concept c ∈ H (hypothesis space) of candidate- elimination algorithm
73 Inductive Bias Consider concept learning algorithm L instances X, target concept c training examples D c = { } let L(x i, D c ) denote the classification assigned to the instance x i by L after training on data D c. Definition: The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples D c ( ∀ x i ∈ X)[(B ∧ D c ∧ x i ) ├ L(x i, D c )] where A├ B means A logically entails B
74 Inductive bias II
75 Inductive bias II
76 Inductive Systems and Equivalent Deductive Systems Candidate Elimination Algorithm Using Hypothesis Space H Inductive System Theorem Prover Equivalent Deductive System Training Examples New Instance Training Examples New Instance Assertion { c H } Inductive bias made explicit Classification of New Instance (or “Don’t Know”) Classification of New Instance (or “Don’t Know”)
77 Three Learners with Different Biases Rote Learner –Weakest bias: anything seen before, i.e., no bias –Store examples –Classify x if and only if it matches previously observed example Version Space Candidate Elimination Algorithm –Stronger bias: concepts belonging to conjunctive H –Store extremal generalizations and specializations –Classify x if and only if it “falls within” S and G boundaries (all members agree) Find-S –Even stronger bias: most specific hypothesis –Prior assumption: any instance not observed to be positive is negative –Classify x based on S set
78 Summary Points 1. Concept learning as search through H 2. General-to-specific ordering over H 3. Version space candidate elimination algorithm 4. S and G boundaries characterize learner’s uncertainty 5. Learner can generate useful queries 6. Inductive leaps possible only if learner is biased 7. Inductive learners can be modelled by equivalent deductive systems