Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better.

Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better themselves. Avoid the “knowledge engineering bottleneck” where everything must be hand-coded. Effective learning is very difficult.

Goal any change in a system that allows it to perform better the second time on repetition of the same task or on another task drawn from the same population (Herbert Simon, 1983).

Machine Learning Symbol-based: A set of symbols represents the entities and relationships of a problem domain. Infer useful generalizations of concepts Connectivist approach: Knowledge is represented by patterns in a network of small, simple processing units. Recognize invariant patterns in data and represent them in the structure.

Machine Learning (cont'd) Genetic algorithms: Population of candidate solutions which mutate, combine with one another, and are selected according to a fitness measure. Stochastic methods: New results are based on both the knower's expectation and the data (Bayes' rule). Often implemented using Markov processes.

Types of Learning Supervised learning: Training examples both positive and negative, are classified by a teacher for use by the learning algorithm. Unsupervised learning: Training data not used Category formation, or conceptual clustering are examples. Reinforcement learning: Agent receives feedback from the environment.

Categorization: Symbol-based What is the data? What are the goals? How is knowledge represented? What is the concept space? What operations may be performed on concepts? How is the concept space searched (heurisitics)?

Example – Arch recognition Problem: How to recognize the concept of 'arch' from building blocks (Winston). Symbolist Supervised learning Both positive and negative examples (near-misses) KR is by semantic networks Graph modification, node generalization Search is data-driven

Example (cont'd) part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, brick) supports(x, z), supports(y,z)

Example (cont'd) part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, pyramid) supports(x, z), supports(y,z)

Example (cont'd) Background knowledge: isa(brick, polygon), isa(pyramid, polygon) Generalization: part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, polygon) supports(x, z), supports(y,z)

Negative Example: Near Miss part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, brick) supports(x, z), supports(y,z) touches(x,y), touches(y,x)

Generalization part(arch, x), part(arch, y), part(arch, z) type(x, brick), type(y, brick), type(z, brick) supports(x, z), supports(y,z) ~touches(x,y), ~touches(y,x)

Version Space Search (Mitchell) The problem is to find a general concept (or set of concepts) that includes the positive examples and excludes the negative ones. Symbolist Supervised learning Both positive and negative examples Predicate calculus Generalization operations Search is data driven

Generalization Operators Replace constant with variable: color(ball, red) -> color(X,red) Drop conjuncts: shape(X,round)^size(X,small)^color(X,red) -> shape(X,round)^color(X,red) Add disjunct: shape(X,round)^color(X,red) -> shape(X,round)^(color(X,red) v color(X,blue) Replace property by more general property: color(X,red) -> color(X, primary_color)

More General Concept Concept p is more general than concept q (or p covers q) if the set of elements that satisfy p is a superset of the set of elements that satisfy q. If p(x) and q(x) are descriptions that classify objects as positive examples, then p(x) -> positive(x) |= q(x) -> positive(x).

Version Space Version space is the set of all concept descriptions that that are consistent with the training examples. Mitchell created three algorithms for finding the version space: specific to general search, general to specific search, and the candidate elimination algorithm which works in both directions.

Specific to General Search S = {first positive training instance}; N = {}; // Set of all negative instances seen so far for each positive instance p { for every s ∊ S, if s doesn't match p, replace s in S with its most specific generalization that matches p; Delete from S all hypotheses more general than others in S; Delete from S all hypotheses that match any n ∊ N; } For every negative instance n { Delete all hypotheses from S that match n; N = N u {n}; }

General to Specific Search G = {most general concept in the concept space}; P = {}; // Set of all positive instances seen so far for each negative instance n { for every g ∊ G, if g matches n, replace g in G with its most specific specialization that doesn't match n; Delete from G all hypotheses more specific than others in G; Delete from G all hypotheses that fail to match some p ∊ G; } For every positive instance g { Delete all hypotheses from G that fail to match p; P = P u {p}; }

Candidate Elimination Algorithm G = {most general concept in the concept space}; S = {first positive training instance}; for each new positive instance p { Delete from G all hypotheses that fail to match p; for every s ∊ S, if s doesn't match p, replace s in S with its most specific generalization that matches p; Delete from S all hypotheses more general than others in S; Delete from S all hypotheses that match any n ∊ N; }

CAE (cont'd) for each negative instance n { Delete from S all hypotheses that match n; for every g ∊ G, if g matches n, replace g in G with its most specific specialization that doesn't match n; Delete from G all hypotheses more specific than others in G; Delete from G all hypotheses that fail to match some p ∊ G; } If G == S and both are singletons, the algorithm has found a single concept that is consistent with the data and the algorithm halts. If G and S become empty, there is no concept that satisfies the data.

Candidate Elimination Algorithm G should always be a superset of S, and the concepts that lie between them satisfy the data. Incremental in nature – can process one training example at a time and form a usable, though incomplete, generalization. Is sensitive to noise and inconsistency in the set of training data. Essentially breadth-first search – heuristics can be used to trim the search space.

LEX: Integrating Algebraic Exprs. LEX (Mitchell, et al.) integrates algebraic expressios by starting with an initial expression and then searching the space of expressions until it finds an equivalent expression with no integral signs. The system induces heuristics that improve its performance based on data obtained from its problem solver.

LEX (cont'd) The operations are the rules of expression transformation: OP1: ∫ r f(x) dx -> r ∫ f(x) dx OP2: ∫ u dv -> uv - ∫ v du OP3: 1 * f(x) -> f(x) OP4: ∫ f 1 (x) + f 2 (x) dx -> ∫ f 1 (x) + ∫ f 2 (x)

Heuristics Heuristics are of the form: If the current problem state matches P then apply operator O with bindings B. Example: If a problem state matches ∫ transcendental(x) dx, then apply OP2 with bindings u = x dv = transcendental(x) dx

Symbol Hierarchy There is a hierarchy of symbols and types: cos, trig, transcendental, etc.

LEX Architecture LEX consists of four components: A generalizer that uses the Candidate Elimination Algorithm to find heuristics, A problem solver that produces traces of problem solutions, A critic that produces positive and negative instances from the problem trace, and A problem generator that produces new candidate problems.

How it works LEX maintains a version space for each operator. The version spaces represents the partially learned heuristic for that operator. The version space is update from the positive and negative examples generated by critic. The problem solver builds a tree of the space searched in solving an integration problem. It does best first search using the partial heuristics.

How it works (cont'd) To decide if an example if positive or negative is an example of the credit assignment problem. After solving a problem, LEX finds the shortest path from the input to the solution. Those operators on the shortest path are classified as positive, and those that are not are classified as negative. Since the search is not admissible, the path may not actually be the shortest one.

ID3 Decision Tree Algorithm A different approach to machine learning is to construct decision trees. At each node we test one property of the object and proceed to the proper child node, until reaching a leaf, at which point we can classify the object. We try to construct the best decision tree, the one with the fewest nodes (decisions). Here there many be many categories, not just positive and negative.

ID3 Problem: Classify a set of instances based on their values of given properties. Symbolist Supervised learning Each instance is classified to a finite type KR is the tree and the operations are tree creation. All instances must be known in advance (non- iterative)

Simple Tree Formation Choose a property. The property divides the set of examples up into subsets depending on their value of that property. Recursively create a sub-tree for each subset. Make all the sub-trees be children of the root which tests the given property.

Caveat The tree that is formed is highly dependent on the order in which the properties are chosen. The idea is to chose the most informative property first, and use that to sub-divide the space of examples. This leads to the best (smallest) tree.

Information Theory The amount of information in a message (Shannon) is a function of the probability of occurrence p of each possible message, namely -log 2 (p). Given a universe of messages M = {m 1, m 2,..., m n } and a probability, p(m i ), for the occurrence of each message, the expect information content of a message M is: I[M] = ( ∑ n i=1 -p(m i ) log 2 (p(m i ))) = E[-log 2 p(m i )]

Choosing the Property The information gain provided by choosing property A at the root of the tree is equal to the total information of the tree minus the amount of information needed to complete the classification of the tree. The amount of information needed to complete the tree is defined at the weighted average of the information in all its subtrees.

Choosing the Property (cont'd) Assuming a set of training instances C, if we make property P with n values the root of the tree, then C will be partitioned into subsets {C 1, C 2,..., C n }. The expected value of the information needed to complete the tree is: E[P] = ∑ n i=1 |C i | / |C| * I[C i ] and the expected information to complete the tree is: gain(P) = I[C] - E[P].

Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better.

Similar presentations

Presentation on theme: "Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better.

Similar presentations

Presentation on theme: "Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better."— Presentation transcript:

Similar presentations

About project

Feedback