Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better.

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

Conceptual Clustering
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Tree Learning - ID3
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Methods of Proof Chapter 7, second half.. Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules: Legitimate (sound)
Types of Algorithms.
Part2 AI as Representation and Search
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 Machine Learning: Symbol-based 10a 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.
18 LEARNING FROM OBSERVATIONS
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
An Introduction to Machine Learning In the area of AI (earlier) machine learning took a back seat to Expert Systems Expert system development usually consists.
1 Chapter 19 Knowledge in Learning Version spaces examples Additional sources used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Machine Learning: Symbol-Based
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
MACHINE LEARNING. 2 What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Radial Basis Function Networks
Learning Chapter 18 and Parts of Chapter 20
Issues with Data Mining
Artificial Intelligence University Politehnica of Bucharest Adina Magda Florea
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Theory Revision Chris Murphy. The Problem Sometimes we: – Have theories for existing data that do not match new data – Do not want to repeat learning.
Introduction to search Chapter 3. Why study search? §Search is a basis for all AI l search proposed as the basis of intelligence l all learning algorithms,
Ch10 Machine Learning: Symbol-Based
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
George F Luger ARTIFICIAL INTELLIGENCE 6th edition Structures and Strategies for Complex Problem Solving Machine Learning: Symbol-Based Luger: Artificial.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
Outline Inductive bias General-to specific ordering of hypotheses
Overview Concept Learning Representation Inductive Learning Hypothesis
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Learning, page 1 CSI 4106, Winter 2005 Symbolic learning Points Definitions Representation in logic What is an arch? Version spaces Candidate elimination.
1 Inductive Learning (continued) Chapter 19 Slides for Ch. 19 by J.C. Latombe.
KU NLP Machine Learning1 Ch 9. Machine Learning: Symbol- based  9.0 Introduction  9.1 A Framework for Symbol-Based Learning  9.2 Version Space Search.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Automated Reasoning Early AI explored how to automated several reasoning tasks – these were solved by what we might call weak problem solving methods as.
Automated Reasoning Early AI explored how to automate several reasoning tasks – these were solved by what we might call weak problem solving methods as.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSCI 4310 Lecture 2: Search. Search Techniques Search is Fundamental to Many AI Techniques.
Machine Learning Concept Learning General-to Specific Ordering
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Learning from Observations
Learning from Observations
Machine Learning: Symbol-Based
Introduce to machine learning
Machine Learning: Symbol-Based
Presented By S.Yamuna AP/CSE
Learning from Observations
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Learning from Observations
Decision trees One possible representation for hypotheses
Inductive Learning (2/2) Version Space and PAC Learning
Presentation transcript:

Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better themselves. Avoid the “knowledge engineering bottleneck” where everything must be hand-coded. Effective learning is very difficult.

Goal any change in a system that allows it to perform better the second time on repetition of the same task or on another task drawn from the same population (Herbert Simon, 1983).

Machine Learning Symbol-based: A set of symbols represents the entities and relationships of a problem domain. Infer useful generalizations of concepts Connectivist approach: Knowledge is represented by patterns in a network of small, simple processing units. Recognize invariant patterns in data and represent them in the structure.

Machine Learning (cont'd) Genetic algorithms: Population of candidate solutions which mutate, combine with one another, and are selected according to a fitness measure. Stochastic methods: New results are based on both the knower's expectation and the data (Bayes' rule). Often implemented using Markov processes.

Types of Learning Supervised learning: Training examples both positive and negative, are classified by a teacher for use by the learning algorithm. Unsupervised learning: Training data not used Category formation, or conceptual clustering are examples. Reinforcement learning: Agent receives feedback from the environment.

Categorization: Symbol-based What is the data? What are the goals? How is knowledge represented? What is the concept space? What operations may be performed on concepts? How is the concept space searched (heurisitics)?

Example – Arch recognition Problem: How to recognize the concept of 'arch' from building blocks (Winston). Symbolist Supervised learning Both positive and negative examples (near-misses) KR is by semantic networks Graph modification, node generalization Search is data-driven

Example (cont'd) part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, brick) supports(x, z), supports(y,z)

Example (cont'd) part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, pyramid) supports(x, z), supports(y,z)

Example (cont'd) Background knowledge: isa(brick, polygon), isa(pyramid, polygon) Generalization: part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, polygon) supports(x, z), supports(y,z)

Negative Example: Near Miss part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, brick) supports(x, z), supports(y,z) touches(x,y), touches(y,x)

Generalization part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, brick) supports(x, z), supports(y,z) ~touches(x,y), ~touches(y,x)

Version Space Search (Mitchell) The problem is to find a general concept (or set of concepts) that includes the positive examples and excludes the negative ones. Symbolist Supervised learning Both positive and negative examples Predicate calculus Generalization operations Search is data driven

Generalization Operators Replace constant with variable: color(ball, red) -> color(X,red) Drop conjuncts: shape(X,round)^size(X,small)^color(X,red) -> shape(X,round)^color(X,red) Add disjunct: shape(X,round)^color(X,red) -> shape(X,round)^(color(X,red) v color(X,blue) Replace property by more general property: color(X,red) -> color(X, primary_color)

More General Concept Concept p is more general than concept q (or p covers q) if the set of elements that satisfy p is a superset of the set of elements that satisfy q. If p(x) and q(x) are descriptions that classify objects as positive examples, then p(x) -> positive(x) |= q(x) -> positive(x).

Version Space Version space is the set of all concept descriptions that that are consistent with the training examples. Mitchell created three algorithms for finding the version space: specific to general search, general to specific search, and the candidate elimination algorithm which works in both directions.

Specific to General Search S = {first positive training instance}; N = {}; // Set of all negative instances seen so far for each positive instance p { for every s ∊ S, if s doesn't match p, replace s in S with its most specific generalization that matches p; Delete from S all hypotheses more general than others in S; Delete from S all hypotheses that match any n ∊ N; } For every negative instance n { Delete all hypotheses from S that match n; N = N u {n}; }

General to Specific Search G = {most general concept in the concept space}; P = {}; // Set of all positive instances seen so far for each negative instance n { for every g ∊ G, if g matches n, replace g in G with its most specific specialization that doesn't match n; Delete from G all hypotheses more specific than others in G; Delete from G all hypotheses that fail to match some p ∊ G; } For every positive instance g { Delete all hypotheses from G that fail to match p; P = P u {p}; }

Candidate Elimination Algorithm G = {most general concept in the concept space}; S = {first positive training instance}; for each new positive instance p { Delete from G all hypotheses that fail to match p; for every s ∊ S, if s doesn't match p, replace s in S with its most specific generalization that matches p; Delete from S all hypotheses more general than others in S; Delete from S all hypotheses that match any n ∊ N; }

CAE (cont'd) for each negative instance n { Delete from S all hypotheses that match n; for every g ∊ G, if g matches n, replace g in G with its most specific specialization that doesn't match n; Delete from G all hypotheses more specific than others in G; Delete from G all hypotheses that fail to match some p ∊ G; } If G == S and both are singletons, the algorithm has found a single concept that is consistent with the data and the algorithm halts. If G and S become empty, there is no concept that satisfies the data.

Candidate Elimination Algorithm G should always be a superset of S, and the concepts that lie between them satisfy the data. Incremental in nature – can process one training example at a time and form a usable, though incomplete, generalization. Is sensitive to noise and inconsistency in the set of training data. Essentially breadth-first search – heuristics can be used to trim the search space.

LEX: Integrating Algebraic Exprs. LEX (Mitchell, et al.) integrates algebraic expressios by starting with an initial expression and then searching the space of expressions until it finds an equivalent expression with no integral signs. The system induces heuristics that improve its performance based on data obtained from its problem solver.

LEX (cont'd) The operations are the rules of expression transformation: OP1: ∫ r f(x) dx -> r ∫ f(x) dx OP2: ∫ u dv -> uv - ∫ v du OP3: 1 * f(x) -> f(x) OP4: ∫ f 1 (x) + f 2 (x) dx -> ∫ f 1 (x) + ∫ f 2 (x)

Heuristics Heuristics are of the form: If the current problem state matches P then apply operator O with bindings B. Example: If a problem state matches ∫ transcendental(x) dx, then apply OP2 with bindings u = x dv = transcendental(x) dx

Symbol Hierarchy There is a hierarchy of symbols and types: cos, trig, transcendental, etc.

LEX Architecture LEX consists of four components: A generalizer that uses the Candidate Elimination Algorithm to find heuristics, A problem solver that produces traces of problem solutions, A critic that produces positive and negative instances from the problem trace, and A problem generator that produces new candidate problems.

How it works LEX maintains a version space for each operator. The version spaces represents the partially learned heuristic for that operator. The version space is update from the positive and negative examples generated by critic. The problem solver builds a tree of the space searched in solving an integration problem. It does best first search using the partial heuristics.

How it works (cont'd) To decide if an example if positive or negative is an example of the credit assignment problem. After solving a problem, LEX finds the shortest path from the input to the solution. Those operators on the shortest path are classified as positive, and those that are not are classified as negative. Since the search is not admissible, the path may not actually be the shortest one.

ID3 Decision Tree Algorithm A different approach to machine learning is to construct decision trees. At each node we test one property of the object and proceed to the proper child node, until reaching a leaf, at which point we can classify the object. We try to construct the best decision tree, the one with the fewest nodes (decisions). Here there many be many categories, not just positive and negative.

ID3 Problem: Classify a set of instances based on their values of given properties. Symbolist Supervised learning Each instance is classified to a finite type KR is the tree and the operations are tree creation. All instances must be known in advance (non- iterative)

Simple Tree Formation Choose a property. The property divides the set of examples up into subsets depending on their value of that property. Recursively create a sub-tree for each subset. Make all the sub-trees be children of the root which tests the given property.

Caveat The tree that is formed is highly dependent on the order in which the properties are chosen. The idea is to chose the most informative property first, and use that to sub-divide the space of examples. This leads to the best (smallest) tree.

Information Theory The amount of information in a message (Shannon) is a function of the probability of occurrence p of each possible message, namely -log 2 (p). Given a universe of messages M = {m 1, m 2,..., m n } and a probability, p(m i ), for the occurrence of each message, the expect information content of a message M is: I[M] = ( ∑ n i=1 -p(m i ) log 2 (p(m i ))) = E[-log 2 p(m i )]

Choosing the Property The information gain provided by choosing property A at the root of the tree is equal to the total information of the tree minus the amount of information needed to complete the classification of the tree. The amount of information needed to complete the tree is defined at the weighted average of the information in all its subtrees.

Choosing the Property (cont'd) Assuming a set of training instances C, if we make property P with n values the root of the tree, then C will be partitioned into subsets {C 1, C 2,..., C n }. The expected value of the information needed to complete the tree is: E[P] = ∑ n i=1 |C i | / |C| * I[C i ] and the expected information to complete the tree is: gain(P) = I[C] - E[P].