START OF DAY 1 Reading: Chap. 1 & 2. Introduction.

Slides:



Advertisements
Similar presentations
Concept Learning and the General-to-Specific Ordering
Advertisements

2. Concept Learning 2.1 Introduction
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Concept Learning DefinitionsDefinitions Search Space and General-Specific OrderingSearch Space and General-Specific Ordering The Candidate Elimination.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Università di Milano-Bicocca Laurea Magistrale in Informatica
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 Machine Learning Lecture 2: Concept Learning and Version Spaces 1.
Evaluating Hypotheses
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
LEARNING DECISION TREES
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Sampling : Error and bias. Sampling definitions  Sampling universe  Sampling frame  Sampling unit  Basic sampling unit or elementary unit  Sampling.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Towers of Hanoi. Introduction This problem is discussed in many maths texts, And in computer science an AI as an illustration of recursion and problem.
Machine Learning Version Spaces Learning. 2  Neural Net approaches  Symbolic approaches:  version spaces  decision trees  knowledge discovery  data.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
CS 478 – Tools for Machine Learning and Data Mining The Need for and Role of Bias.
1 Machine Learning What is learning?. 2 Machine Learning What is learning? “That is what learning is. You suddenly understand something you've understood.
Machine Learning Chapter 11.
Machine Learning CSE 681 CH2 - Supervised Learning.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
Learning from Observations Chapter 18 Through
Advanced Topics in Propositional Logic Chapter 17 Language, Proof and Logic.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.
Chapter 2: Concept Learning and the General-to-Specific Ordering.
Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
CpSc 810: Machine Learning Concept Learning and General to Specific Ordering.
Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.
Outline Inductive bias General-to specific ordering of hypotheses
Overview Concept Learning Representation Inductive Learning Hypothesis
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS Machine Learning 15 Jan Inductive Classification.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
For Monday Finish chapter 19 No homework. Program 4 Any questions?
Automated Reasoning Early AI explored how to automated several reasoning tasks – these were solved by what we might call weak problem solving methods as.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
For Monday Finish chapter 19 Take-home exam due. Program 4 Any questions?
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 2-Concept Learning (1/3) Eduardo Poggi
Matching Lecture 19: Nov 23.
Machine Learning Concept Learning General-to Specific Ordering
Data Mining and Decision Support
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Concept Learning and The General-To Specific Ordering
Computational Learning Theory Part 1: Preliminaries 1.
Concept learning Maria Simi, 2011/2012 Machine Learning, Tom Mitchell Mc Graw-Hill International Editions, 1997 (Cap 1, 2).
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Observations Inductive Learning Decision Trees Ensembles.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Chapter 2 Concept Learning
CS 9633 Machine Learning Concept Learning
Analytical Learning Discussion (4 of 4):
Ordering of Hypothesis Space
Concept Learning.
Machine Learning Chapter 2
Implementation of Learning Systems
Version Space Machine Learning Fall 2018.
Machine Learning Chapter 2
Presentation transcript:

START OF DAY 1 Reading: Chap. 1 & 2

Introduction

How Do We Learn? By being told – Content transfer By analogy – Context transfer By induction – Knowledge construction

Learning by Induction Induction is a process that "involves intellectual leaps from the particular to the general" Orley’s Experience

Group These Objects Write down your answer – Keep it to yourself

UGLY PRETTY Find the Rule(s) Write down your answer – Keep it to yourself

Requirements for Induction A language, L p, to represent the particular (i.e., specific instances) A language, L g, to represent the general (i.e., generalizations) A matching predicate, match(G,i), that is true if G is correct about i A set, I, of particulars of some unknown general G*

What Does Induction Do? Observes the particulars in I Produces a generalization, G, such that: – G is consistent: For all i  I, match(G,i) – G generalizes beyond I: For many i  I, match(G,i) – G resembles G*: For most i, G(i)=G*(i)

Two Different Contexts Picture 1 is an example of Unsupervised Learning – There is no predefined label for the instances Picture 2 is an example of Supervised Learning – There is a predefined label (or class assignment) for the instances

Unsupervised Learning Consistency is (vacuously) guaranteed since you pick G (=G*) and then assign instances to groups based on G Generalization accuracy is also guaranteed, for the same reason Need a mechanism to choose among possible groupings (i.e., what makes one grouping better than another?) – Internal metrics (e.g., compactness) – External metrics (e.g., labeled data – assumes G*)

Supervised Learning Must try to achieve consistency – Why is that desirable? – Is it always possible? Noise (i.e., mislabeled instances) Limitation of L g (i.e., cannot represent a consistent G) Generalization accuracy can be measured directly Need a mechanism to choose among possible generalizations (i.e., what makes one generalization “better” than another? Remember, we do not know G*)

Revisiting Our Examples Picture 1 – What groupings did you come up with? – How good are your groupings? Example 2 – What generalizations did you come up with? – Are they consistent? – How well do you think your generalizations will do beyond the observed instances?

Zooming in on Picture 2 There are obviously several generalizations that are consistent: – If red or green then class 1 otherwise class 2 – If less than 3 edges then class 1 otherwise class 2 So: why did you choose the one you did? Suppose that I now tell you that the PRETTY class is the set of complex polygons (i.e., with 4 sides or more). Would that change your preference? Why? BIAS

The Need for and Role of Bias

Recall: Concept Learning Given: A language of observations/instances A language of concepts/generalizations A matching predicate A set of observations Find generalizations that: 1. Are consistent with the observations, and 2. Classify instances beyond those observed

Working Example (I) Observations are characterized by a fixed set of features (or attributes) and each instance corresponds to a specific assignment of values to the features For example: Features: color [red, blue, green, yellow] shape [square, triangle, circle] size [large, medium, small] Observation: O1 O2 O3, etc. Language of instances = set of attribute-value pairs Attribute-value Language (AVL)

Working Example (II) Generalizations are used to represent N≥1 instances For example: G1 represents all red objects (independent of color and size) G2 represents small blue objects that are either squares or triangles Each generalization can be viewed as a set, namely of the instances it represents (or matches) In the above example: G1={,,,,…, } G2={, } Language of generalizations = designer chooses

Our First Learning Algorithm Definitions: – T = set of training instances – S = set of maximally specific generalizations consistent with T – G = set of maximally general generalizations consistent T Version Space Algorithm: – To the smallest extent necessary to maintain consistency with T – S keeps generalizing to accommodate new positive instances – G keeps specializing to avoid new negative instances – G remains as general as possible and S remains as specific as possible

Version Space Learning Initialize G to the most general concept in the space Initialize S to the first positive training instance For each new positive training instance p – Delete all members of G that do not cover p – For each s in S If s does not cover p – Replace s with its most specific generalizations that cover p – Remove from S any element more general than some other element in S – Remove from S any element not more specific than some element in G For each new negative training instance n – Delete all members of S that cover n – For each g in G If g covers n – Replace g with its most general specializations that do not cover n – Remove from G any element more specific than some other element in G – Remove from G any element more specific than some element in S If G=S and both are singletons – A single concept consistent with the training data has been found If G and S become empty – There is no concept consistent with the training data

Visualizing the Version Space ? ? ? ? Boundary of G Boundary of S

VS Example (I) Task: – Two-object diagrams classification Instance Language: – AVL: Size x Color x Shape Generalization Language – AVL U {*} Training Instances – P 1 : {(Large Red Triangle) (Small Blue Circle)} – P 2 : {(Large Blue Circle) (Small Red Triangle)} – N 1 : {(Large Blue Triangle) (Small Blue Triangle)}

VS Example (II) Initialization – G 0 = [{(* * *) (* * *)}]most general – S 0 =  most specific

VS Example (III) (L R T) (L R T) (S R T)(S R T) (L R T) (L R C)(L R T) (S R C)(S R T) (S R C) (L R T) (L B T)(L R T) (S B T)(S R T) (S B T) (L R T) (L B C)(L R T) (S B C)(S R T) (S B C) (L R C) (S R T) (L R C) (L R C) (S R C)(S R C) (L R C) (L B T)(L R C) (S B T)(S R C) (S B T) (L R C) (L B C)(L R C) (S B C)(S R C) (S B C) (L B T) (S R T) (L B T) (S R C) (L B T) (L B T) (S B T)(S B T) (L B T) (L B C)(L B T) (S B C)(S B T) (S B C) (L B C) (S R T) (L B C) (S R C) (L B C) (S B T) (L B C) (L B C) (S B C)(S B C)

VS Example (IV) After P1 – G 1 = [{(* * *) (* * *)}] no change – S 1 = [{(L R T) (S B C)}] minimum generalization to cover P 1

VS Example (V) (L R T) (L R T) (S R T)(S R T) (L R T) (L R C)(L R T) (S R C)(S R T) (S R C) (L R T) (L B T)(L R T) (S B T)(S R T) (S B T) (L R T) (L B C)(L R T) (S B C)(S R T) (S B C) (L R C) (S R T) (L R C) (L R C) (S R C)(S R C) (L R C) (L B T)(L R C) (S B T)(S R C) (S B T) (L R C) (L B C)(L R C) (S B C)(S R C) (S B C) (L B T) (S R T) (L B T) (S R C) (L B T) (L B T) (S B T)(S B T) (L B T) (L B C)(L B T) (S B C)(S B T) (S B C) (L B C) (S R T) (L B C) (S R C) (L B C) (S B T) (L B C) (L B C) (S B C)(S B C)

VS Example (VI) After P2 – G 2 = [{(* * *) (* * *)}] no change – S 2 = [{(L * *) (S * *)} {(* R T) (* B C)}] minimum generalization to cover P 2 S extends its boundary to cover the new positive instance, but only as far as needed – No zeal!

VS Example (VII) (L R T) (L R T) (S R T)(S R T) (L R T) (L R C)(L R T) (S R C)(S R T) (S R C) (L R T) (L B T)(L R T) (S B T)(S R T) (S B T) (L R T) (L B C)(L R T) (S B C)(S R T) (S B C) (L R C) (S R T) (L R C) (L R C) (S R C)(S R C) (L R C) (L B T)(L R C) (S B T)(S R C) (S B T) (L R C) (L B C)(L R C) (S B C)(S R C) (S B C) (L B T) (S R T) (L B T) (S R C) (L B T) (L B T) (S B T)(S B T) (L B T) (L B C)(L B T) (S B C)(S B T) (S B C) (L B C) (S R T) (L B C) (S R C) (L B C) (S B T) (L B C) (L B C) (S B C)(S B C)

VS Example (VIII) After N1 – G 3 = [{(* R *) (* * *)} {(* * C) (* * *)}] minimum specialization to exclude N 1 – S 3 = [{(* R T) (* B C)}] remove inconsistent generalization G contracts its boundary to exclude the new negative instance, but only as far as needed – No zeal!

VS Example (IX) (L R T) (L R T) (S R T)(S R T) (L R T) (L R C)(L R T) (S R C)(S R T) (S R C) (L R T) (L B T)(L R T) (S B T)(S R T) (S B T) (L R T) (L B C)(L R T) (S B C)(S R T) (S B C) (L R C) (S R T) (L R C) (L R C) (S R C)(S R C) (L R C) (L B T)(L R C) (S B T)(S R C) (S B T) (L R C) (L B C)(L R C) (S B C)(S R C) (S B C) (L B T) (S R T) (L B T) (S R C) (L B T) (L B T) (S B T)(S B T) (L B T) (L B C)(L B T) (S B C)(S B T) (S B C) (L B C) (S R T) (L B C) (S R C) (L B C) (S B T) (L B C) (L B C) (S B C)(S B C)

Predicting with the Version Space A new instance is classified as positive if and only if it is covered by every generalization in the version space A new instance is classified as negative if and only if no generalization in the version space covers it If some, but not all, of the generalizations in the version space cover the new instance, then the instance cannot be classified with certainty (An estimated classification based on the proportion of generalizations within the version space, that cover and do not cover the new instance, could be given)

VS Example (X) The Version Space is: – G 3 = [{(* R *) (* * *)} {(* * C) (* * *)}] – S 3 = [{(* R T) (* B C)}] Predicting: – {(Small Red Triangle) (Small Blue Circle)}: + – {(Small Blue Triangle) (Large Red Circle)}: - – {(Large Red Triangle) (Large Blue Triangle)}: ?

Taking Stock Does VS solve the Concept Learning problem? It produces generalizations The generalizations are consistent with T The generalizations extend beyond T Why/how does it work? Let’s make some assumptions and replay

Unbiased Generalization Language A language such that every possible subset of instances can be represented AVL U {*} is not unbiased Every replacement by * causes the representation of ALL of the values of the corresponding attribute AVL U { ∨ } is better but still not unbiased It cannot represent {, } In UGL, all subsets must have a representation I.e., UGL = power set of the given instance language

Unbiased Generalization Procedure Uses Unbiased Generalization Language Computes Version Space (VS) relative to UGL VS = set of all expressible generalizations consistent with the training instances (in case that was not quite clear still)

Claim VS with UGL cannot solve part 2 of the Concept Learning problem, i.e., learning is limited to rote learning

Lemma 1 Any new instance, NI, is classified as positive if and only if NI is identical to some observed positive instance Proof: (  ) If NI is identical to some observed positive instance, then NI is classified as positive – Follows directly from the definition of VS (  ) If NI is classified as positive, then NI is identical to some observed positive instance – Let g={p: p is an observed positive instance} UGL  g  VS NI matches all of VS  NI matches g

Lemma 2 Any new instance, NI, is classified as negative if and only if NI is identical to some observed negative instance Proof: (  ) If NI is identical to some observed negative instance, then NI is classified as negative – Follows directly from the definition of VS (  ) If NI is classified as negative, then NI is identical to some observed negative instance – Let G={all subsets containing observed negative instances} UGL  G  VS=UGL NI matches none in VS  NI was observed

Lemma 3 If NI is any instance which was not observed, then NI matches exactly one half of VS and cannot be classified Proof: (  ) If NI was not observed, then NI matches exactly one half of VS, and so cannot be classified – Let g={p: p is an observed positive instance} – Let G’={all subsets of unobserved instances} UGL  VS={g  g’: g’  G’} NI was not observed  NI matches exactly ½ of G’  NI matches exactly ½ of VS

Theorem It follows directly from Lemmas 1-3 that: An unbiased generalization procedure can never make the inductive leap necessary to classify instances beyond those it has observed

Another Way to Look at It… There are 2 2 n Boolean functions of n inputs x1x2x3ClassPossible Consistent Function Hypotheses ? What do we predict for 1 1 1? There are as many consistent functions predicting 0 as there are consistent functions predicting 1!

Yet Another Way to Look at it… If there is no bias, the outcome of the learner is highly dependent on the training data, and thus there is much variance among the models induced from different sets of observations – Learner memorizes (overfits) If there is a strong bias, the outcome of the learner is much less dependent on the training data, and thus there is little variance among induced models – Learner ignores observation Formalized as: bias-variance decomposition of error Semmelweis’ Experience

What is Bias? BIAS Any basis for choosing one decision over another, other than strict consistency with past observations

Going Back to Our Question Our example worked BECAUSE the generalization language (AVL U {*}) was not unbiased! In fact, we have just showed that: If a learning system is to be useful, it must have some form of bias

Humans as Learning Systems Do we have biases? – All kinds!!! Our/the representation language cannot express all possible classes of observations Our/the generalization procedure is biased – Domain knowledge (e.g., double bonds rarely break) – Intended use (e.g., ICU – relative cost) – Shared assumptions (e.g., crown, bridge – dentistry) – Simplicity and generality (e.g., white men can’t jump) – Analogy (e.g., heat vs. water flow, thin ice) – Commonsense (e.g., social interactions, pain, etc.) Survey Exercise

Our First Lesson The power of a generalization system follows directly from its biases Absence of bias = rote learning Progress towards understanding learning mechanisms depends upon understanding the sources of, and justification for, various biases We will consider these issues for every algorithm we will study

Are There Better Biases?

No Free Lunch Theorem A.k.a Law of Conservation for Generalization Performance (LCG) GP = Accuracy – 50% When taken across all learning tasks, the generalization performance of any learner sums to 0

NFL Intuition (I)

NFL Intuition (II)

NFL Intuition (III)

Second Lesson Whenever a learning algorithm performs well on some function, as measured by OTS generalization, it must perform poorly on some other(s) In other words, there is no universal learner, or best bias!

Impact on Users

Towards a Solution

Taking Stock We will study a number of learning algorithms We promise to: – Discuss their language and procedural biases – Always remember NFL

END OF DAY 1 Homework: Thought Questions