Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern.

Learning R&N: ch 19, ch 20

Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern discovery Reinforcement Learning – learning MDPs, online learning

Supervised Learning Someone gives you a bunch of examples, telling you what each one is Eventually, you figure out the mapping from properties (features) of the examples to their type (classification)

Supervised Learning Logic-based approaches: learn a function f(X)  (true,false) Statistical/Probabilistic approaches Learn a probability distribution p(Y|X)

Outline Logic-based Approaches Decision Trees (lecture 24) Version Spaces (today) PAC learning (we won’t cover,  ) Instance-based approaches Statistical Approaches

Predicate-Learning Methods Version space Explicit representation of hypothesis space H Need to provide H with some “structure”

Version Spaces The “version space” is the set of all hypotheses that are consistent with the training instances processed so far. An algorithm: V := H ;; the version space V is ALL hypotheses H For each example e:  Eliminate any member of V that disagrees with e  If V is empty, FAIL Return V as the set of consistent hypotheses

Version Spaces: The Problem PROBLEM: V is huge!! Suppose you have N attributes, each with k possible values Suppose you allow a hypothesis to be any disjunction of instances There are k N possible instances  |H| = 2 k N If N=5 and k=2, |H| = 2 32 !!

Version Spaces: The Tricks First Trick: Don’t allow arbitrary disjunctions Organize the feature values into a hierarchy of allowed disjunctions, e.g. any-color yellowwhite pale blue dark black Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed) Second Trick: Define a partial ordering on H (“general to specific”) and only keep track of the upper bound and lower bound of the version space RESULT: An incremental, possibly efficient algorithm!

Rewarded Card Example (r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K)  ANY-RANK(r) (r=1) v … v (r=10)  NUM(r) (r=J) v (r=Q) v (r=K)  FACE(r) (s=  ) v (s=  ) v (s=  ) v (s= )  ANY-SUIT(s) (s=  ) v (s=  )  BLACK(s) (s=  ) v (s= )  RED(s) A hypothesis is any sentence of the form: R(r)  S(s)  IN-CLASS([r,s]) where: R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j) S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)

Simplified Representation For simplicity, we represent a concept by rs, with: r  {a, n, f, 1, …, 10, j, q, k} s  {a, b, r, , , , } For example: n  represents: NUM(r)  (s=  )  IN-CLASS([r,s]) aa represents: ANY-RANK(r)  ANY-SUIT(s)  IN-CLASS([r,s])

Extension of a Hypothesis The extension of a hypothesis h is the set of objects that satisfies h Examples: The extension of f  is: {j , q , k  } The extension of aa is the set of all cards

More General/Specific Relation Let h 1 and h 2 be two hypotheses in H h 1 is more general than h 2 iff the extension of h 1 is a proper superset of the extension of h 2 Examples: aa is more general than f  f is more general than q fr and nr are not comparable

More General/Specific Relation Let h 1 and h 2 be two hypotheses in H h 1 is more general than h 2 iff the extension of h 1 is a proper superset of the extension of h 2 The inverse of the “more general” relation is the “more specific” relation The “more general” relation defines a partial ordering on the hypotheses in H

Example: Subset of Partial Order aa naab nb nn 44 4b aa 4a

Construction of Ordering Relation 110 n a f jk ……   b a r 

G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V

G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V A hypothesis in V is most specific iff no hypothesis in V is more specific S-boundary S of V: Set of most specific hypotheses in V

aa naab nb nn 44 4b aa 4a Example: G-/S-Boundaries of V aa 44 11 k …… Now suppose that 4  is given as a positive example S G We replace every hypothesis in S whose extension does not contain 4  by its generalization set

Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Here, both G and S have size 1. This is not the case in general!

Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Let 7  be the next (positive) example Generalization set of 4  The generalization set of an hypothesis h is the set of the hypotheses that are immediately more general than h

Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Let 7  be the next (positive) example

Example: G-/S-Boundaries of V aa naab nb nn aa Let 5 be the next (negative) example Specialization set of aa

Example: G-/S-Boundaries of V ab nb nn aa G and S, and all hypotheses in between form exactly the version space

Example: G-/S-Boundaries of V ab nb nn aa Do 8 , 6 , j  satisfy CONCEPT? Yes No Maybe At this stage …

Example: G-/S-Boundaries of V ab nb nn aa Let 2  be the next (positive) example

Example: G-/S-Boundaries of V ab nb Let j  be the next (negative) example

Example: G-/S-Boundaries of V nb + 4  7  2  – 5 j  NUM(r)  BLACK(s)  IN-CLASS([r,s])

Example: G-/S-Boundaries of V ab nb nn aa … and let 8  be the next (negative) example Let us return to the version space … The only most specific hypothesis disagrees with this example, so no hypothesis in H agrees with all examples

Example: G-/S-Boundaries of V ab nb nn aa … and let j be the next (positive) example Let us return to the version space … The only most general hypothesis disagrees with this example, so no hypothesis in H agrees with all examples

Version Space Update 1.x  new example 2.If x is positive then (G,S)  POSITIVE-UPDATE(G,S,x) 3.Else (G,S)  NEGATIVE-UPDATE(G,S,x) 4.If G or S is empty then return failure

POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x

POSITIVE-UPDATE(G,S,x) 2.Minimally generalize all hypotheses in S until they are consistent with x Using the generalization sets of the hypotheses

POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x 2.Minimally generalize all hypotheses in S until they are consistent with x 3.Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G This step was not needed in the card example

POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x 2.Minimally generalize all hypotheses in S until they are consistent with x 3.Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G 4.Remove from S every hypothesis that is more general than another hypothesis in S 5.Return (G,S)

NEGATIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in S that do not agree with x 2.Minimally specialize all hypotheses in G until they are consistent with x 3.Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S 4.Remove from G every hypothesis that is more specific than another hypothesis in G 5.Return (G,S)

Example-Selection Strategy (aka Active Learning) Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps

aa naab nb nn aa Example 9  ? j ? j  ?

Example-Selection Strategy Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps But picking the object that eliminates half the version space may be expensive

Noise If some examples are misclassified, the version space may collapse Possible solution: Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc…

Version Spaces Useful for understanding logical (consistency-based) approaches to learning Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis) Don’t handle noise well

VSL vs DTL Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one Version space learning (VSL) is incremental DTL can produce simplified hypotheses that do not agree with all examples DTL has been more widely used in practice

Statistical Approaches Instance-based Learning (20.4) Statistical Learning (20.1) Neural Networks (20.5)

Nearest Neighbor Methods To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class x k=1 k=6

Issues Distance measure Most common: euclidean Better distance measures: normalize each variable by standard deviation Choosing k Increasing k reduces variance, increases bias For high-dimensional space, problem that the nearest neighbor may not be very close at all! Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. Indexing the data can help; for example KD trees

Bayesian Learning

Example: Candy Bags Candy comes in two flavors: cherry ( ) and lime (  ) Candy is wrapped, can’t tell which flavor until opened There are 5 kinds of bags of candy: H 1 = all cherry H 2 = 75% cherry, 25% lime H 3 = 50% cherry, 50% lime H 4 = 25% cherry, 75% lime H 5 = 100% lime Given a new bag of candy, predict H Observations: D 1, D 2, D 3, …

Bayesian Learning Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) Now, if we want to predict some unknown quantity X

Bayesian Learning cont. Calculating P(h|d) Assume the observations are i.i.d.— independent and identically distributed likelihoodprior

Example: Hypothesis Prior over h 1, …, h 5 is {0.1,0.2,0.4,0.2,0.1} Data: Q1: After seeing d 1, what is P(h i |d 1 )? Q2: After seeing d 1, what is P(d 2 = |d 1 )?

Statistical Learning Approaches Bayesian MAP – maximum a posteriori ** MLE – assume uniform prior over H

Naïve Bayes aka Idiot Bayes particularly simple BN makes overly strong independence assumptions but works surprisingly well in practice…

Bayesian Diagnosis suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d 1, …, d n suppose there are m boolean symptoms, E 1, …, E m how do we make diagnosis? we need:

Naïve Bayes Assumption Assume each piece of evidence (symptom) is independent give the diagnosis then what is the structure of the corresponding BN?

Naïve Bayes Example possible diagnosis: Allergy, Cold and OK possible symptoms: Sneeze, Cough and Fever WellColdAllergy P(d)0.90.05 P(sneeze|d)0.10.9 P(cough|d)0.10.80.7 P(fever|d)0.010.70.4 my symptoms are: sneeze & cough, what is the diagnosis?

Learning the Probabilities aka parameter estimation we need P(d i ) – prior P(e k |d i ) – conditional probability use training data to estimate

Maximum Likelihood Estimate (MLE) use frequencies in training set to estimate: where n x is shorthand for the counts of events in training set

Example: DSneezeCoughFever Allergyyesno Wellyesno Allergyyesnoyes Allergyyesno Coldyes Allergyyesno Wellno Wellno Allergyno Allergyyesno what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)?

Laplace Estimate (smoothing) use smoothing to eliminate zeros: where n is number of possible values for d and e is assumed to have 2 possible values many other smoothing schemes…

Comments Generally works well despite blanket assumption of independence Experiments show competitive with decision trees on some well known test sets (UCI) handles noisy data

Learning more complex Bayesian networks Two subproblems: learning structure: combinatorial search over space of networks learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

Neural Networks

Neural function Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength There are about 10 11 neurons and about 10 14 synapses in the human brain

Biology of a neuron

Brain structure Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don’t know how different functions are “assigned” or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Our neural networks are not nearly as complex or intricate as the actual brain structure

Comparison of computing power Computers are way faster than neurons… But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster

Neural networks Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output

Neural unit

Model Neuron Neuron modeled a unit j weights on input unit I to j, w ji net input to unit j is: threshold T j o j is 1 if net j > T j

Neural Computation McCollough and Pitt (1943)showed how LTU can be use to compute logical functions AND? OR? NOT? Two layers of LTUs can represent any boolean function

Learning Rules Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes a single linear threshold unit

Perceptron Learning rule If the target output for unit j is t j Equivalent to the intuitive rules: If output is correct, don’t change the weights If output is low (o j =0, t j =1), increment weights for all the inputs which are 1 If output is high (o j =1, t j =0), decrement weights for all inputs which are 1 Must also adjust threshold. Or equivalently asuume there is a weight w j0 for an extra input unit that has o 0 =1

Perceptron Learning Algorithm Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or random) Until outputs for all training examples are correct  for each training example e do compute the current output o j compare it to the target t j and update weights each execution of outer loop is an epoch for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold Q: when will the algorithm terminate?

Representation Limitations of a Perceptron Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

Perceptron Learnability Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) Unfortunately, many functions (like parity) cannot be represented by LTU

Layered feed-forward network Output units Hidden units Input units

“Executing” neural networks Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node

Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern.

Similar presentations

Presentation on theme: "Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern.

Similar presentations

Presentation on theme: "Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern."— Presentation transcript:

Similar presentations

About project

Feedback