Learning R&N: ch 19, ch 20
Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern discovery Reinforcement Learning – learning MDPs, online learning
Supervised Learning Someone gives you a bunch of examples, telling you what each one is Eventually, you figure out the mapping from properties (features) of the examples to their type (classification)
Supervised Learning Logic-based approaches: learn a function f(X) (true,false) Statistical/Probabilistic approaches Learn a probability distribution p(Y|X)
Outline Logic-based Approaches Decision Trees (lecture 24) Version Spaces (today) PAC learning (we won’t cover, ) Instance-based approaches Statistical Approaches
Predicate-Learning Methods Version space Explicit representation of hypothesis space H Need to provide H with some “structure”
Version Spaces The “version space” is the set of all hypotheses that are consistent with the training instances processed so far. An algorithm: V := H ;; the version space V is ALL hypotheses H For each example e: Eliminate any member of V that disagrees with e If V is empty, FAIL Return V as the set of consistent hypotheses
Version Spaces: The Problem PROBLEM: V is huge!! Suppose you have N attributes, each with k possible values Suppose you allow a hypothesis to be any disjunction of instances There are k N possible instances |H| = 2 k N If N=5 and k=2, |H| = 2 32 !!
Version Spaces: The Tricks First Trick: Don’t allow arbitrary disjunctions Organize the feature values into a hierarchy of allowed disjunctions, e.g. any-color yellowwhite pale blue dark black Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed) Second Trick: Define a partial ordering on H (“general to specific”) and only keep track of the upper bound and lower bound of the version space RESULT: An incremental, possibly efficient algorithm!
Rewarded Card Example (r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K) ANY-RANK(r) (r=1) v … v (r=10) NUM(r) (r=J) v (r=Q) v (r=K) FACE(r) (s= ) v (s= ) v (s= ) v (s= ) ANY-SUIT(s) (s= ) v (s= ) BLACK(s) (s= ) v (s= ) RED(s) A hypothesis is any sentence of the form: R(r) S(s) IN-CLASS([r,s]) where: R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j) S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)
Simplified Representation For simplicity, we represent a concept by rs, with: r {a, n, f, 1, …, 10, j, q, k} s {a, b, r, , , , } For example: n represents: NUM(r) (s= ) IN-CLASS([r,s]) aa represents: ANY-RANK(r) ANY-SUIT(s) IN-CLASS([r,s])
Extension of a Hypothesis The extension of a hypothesis h is the set of objects that satisfies h Examples: The extension of f is: {j , q , k } The extension of aa is the set of all cards
More General/Specific Relation Let h 1 and h 2 be two hypotheses in H h 1 is more general than h 2 iff the extension of h 1 is a proper superset of the extension of h 2 Examples: aa is more general than f f is more general than q fr and nr are not comparable
More General/Specific Relation Let h 1 and h 2 be two hypotheses in H h 1 is more general than h 2 iff the extension of h 1 is a proper superset of the extension of h 2 The inverse of the “more general” relation is the “more specific” relation The “more general” relation defines a partial ordering on the hypotheses in H
Example: Subset of Partial Order aa naab nb nn 44 4b aa 4a
Construction of Ordering Relation 110 n a f jk …… b a r
G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V
G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V A hypothesis in V is most specific iff no hypothesis in V is more specific S-boundary S of V: Set of most specific hypotheses in V
aa naab nb nn 44 4b aa 4a Example: G-/S-Boundaries of V aa 44 11 k …… Now suppose that 4 is given as a positive example S G We replace every hypothesis in S whose extension does not contain 4 by its generalization set
Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Here, both G and S have size 1. This is not the case in general!
Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Let 7 be the next (positive) example Generalization set of 4 The generalization set of an hypothesis h is the set of the hypotheses that are immediately more general than h
Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Let 7 be the next (positive) example
Example: G-/S-Boundaries of V aa naab nb nn aa Let 5 be the next (negative) example Specialization set of aa
Example: G-/S-Boundaries of V ab nb nn aa G and S, and all hypotheses in between form exactly the version space
Example: G-/S-Boundaries of V ab nb nn aa Do 8 , 6 , j satisfy CONCEPT? Yes No Maybe At this stage …
Example: G-/S-Boundaries of V ab nb nn aa Let 2 be the next (positive) example
Example: G-/S-Boundaries of V ab nb Let j be the next (negative) example
Example: G-/S-Boundaries of V nb + 4 7 2 – 5 j NUM(r) BLACK(s) IN-CLASS([r,s])
Example: G-/S-Boundaries of V ab nb nn aa … and let 8 be the next (negative) example Let us return to the version space … The only most specific hypothesis disagrees with this example, so no hypothesis in H agrees with all examples
Example: G-/S-Boundaries of V ab nb nn aa … and let j be the next (positive) example Let us return to the version space … The only most general hypothesis disagrees with this example, so no hypothesis in H agrees with all examples
Version Space Update 1.x new example 2.If x is positive then (G,S) POSITIVE-UPDATE(G,S,x) 3.Else (G,S) NEGATIVE-UPDATE(G,S,x) 4.If G or S is empty then return failure
POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x
POSITIVE-UPDATE(G,S,x) 2.Minimally generalize all hypotheses in S until they are consistent with x Using the generalization sets of the hypotheses
POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x 2.Minimally generalize all hypotheses in S until they are consistent with x 3.Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G This step was not needed in the card example
POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x 2.Minimally generalize all hypotheses in S until they are consistent with x 3.Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G 4.Remove from S every hypothesis that is more general than another hypothesis in S 5.Return (G,S)
NEGATIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in S that do not agree with x 2.Minimally specialize all hypotheses in G until they are consistent with x 3.Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S 4.Remove from G every hypothesis that is more specific than another hypothesis in G 5.Return (G,S)
Example-Selection Strategy (aka Active Learning) Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps
aa naab nb nn aa Example 9 ? j ? j ?
Example-Selection Strategy Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps But picking the object that eliminates half the version space may be expensive
Noise If some examples are misclassified, the version space may collapse Possible solution: Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc…
Version Spaces Useful for understanding logical (consistency-based) approaches to learning Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis) Don’t handle noise well
VSL vs DTL Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one Version space learning (VSL) is incremental DTL can produce simplified hypotheses that do not agree with all examples DTL has been more widely used in practice
Statistical Approaches Instance-based Learning (20.4) Statistical Learning (20.1) Neural Networks (20.5)
Nearest Neighbor Methods To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class x k=1 k=6
Issues Distance measure Most common: euclidean Better distance measures: normalize each variable by standard deviation Choosing k Increasing k reduces variance, increases bias For high-dimensional space, problem that the nearest neighbor may not be very close at all! Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. Indexing the data can help; for example KD trees
Bayesian Learning
Example: Candy Bags Candy comes in two flavors: cherry ( ) and lime ( ) Candy is wrapped, can’t tell which flavor until opened There are 5 kinds of bags of candy: H 1 = all cherry H 2 = 75% cherry, 25% lime H 3 = 50% cherry, 50% lime H 4 = 25% cherry, 75% lime H 5 = 100% lime Given a new bag of candy, predict H Observations: D 1, D 2, D 3, …
Bayesian Learning Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) Now, if we want to predict some unknown quantity X
Bayesian Learning cont. Calculating P(h|d) Assume the observations are i.i.d.— independent and identically distributed likelihoodprior
Example: Hypothesis Prior over h 1, …, h 5 is {0.1,0.2,0.4,0.2,0.1} Data: Q1: After seeing d 1, what is P(h i |d 1 )? Q2: After seeing d 1, what is P(d 2 = |d 1 )?
Statistical Learning Approaches Bayesian MAP – maximum a posteriori ** MLE – assume uniform prior over H
Naïve Bayes aka Idiot Bayes particularly simple BN makes overly strong independence assumptions but works surprisingly well in practice…
Bayesian Diagnosis suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d 1, …, d n suppose there are m boolean symptoms, E 1, …, E m how do we make diagnosis? we need:
Naïve Bayes Assumption Assume each piece of evidence (symptom) is independent give the diagnosis then what is the structure of the corresponding BN?
Naïve Bayes Example possible diagnosis: Allergy, Cold and OK possible symptoms: Sneeze, Cough and Fever WellColdAllergy P(d) P(sneeze|d) P(cough|d) P(fever|d) my symptoms are: sneeze & cough, what is the diagnosis?
Learning the Probabilities aka parameter estimation we need P(d i ) – prior P(e k |d i ) – conditional probability use training data to estimate
Maximum Likelihood Estimate (MLE) use frequencies in training set to estimate: where n x is shorthand for the counts of events in training set
Example: DSneezeCoughFever Allergyyesno Wellyesno Allergyyesnoyes Allergyyesno Coldyes Allergyyesno Wellno Wellno Allergyno Allergyyesno what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)?
Laplace Estimate (smoothing) use smoothing to eliminate zeros: where n is number of possible values for d and e is assumed to have 2 possible values many other smoothing schemes…
Comments Generally works well despite blanket assumption of independence Experiments show competitive with decision trees on some well known test sets (UCI) handles noisy data
Learning more complex Bayesian networks Two subproblems: learning structure: combinatorial search over space of networks learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’
Neural Networks
Neural function Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength There are about neurons and about synapses in the human brain
Biology of a neuron
Brain structure Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don’t know how different functions are “assigned” or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Our neural networks are not nearly as complex or intricate as the actual brain structure
Comparison of computing power Computers are way faster than neurons… But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster
Neural networks Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output
Neural unit
Model Neuron Neuron modeled a unit j weights on input unit I to j, w ji net input to unit j is: threshold T j o j is 1 if net j > T j
Neural Computation McCollough and Pitt (1943)showed how LTU can be use to compute logical functions AND? OR? NOT? Two layers of LTUs can represent any boolean function
Learning Rules Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes a single linear threshold unit
Perceptron Learning rule If the target output for unit j is t j Equivalent to the intuitive rules: If output is correct, don’t change the weights If output is low (o j =0, t j =1), increment weights for all the inputs which are 1 If output is high (o j =1, t j =0), decrement weights for all inputs which are 1 Must also adjust threshold. Or equivalently asuume there is a weight w j0 for an extra input unit that has o 0 =1
Perceptron Learning Algorithm Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or random) Until outputs for all training examples are correct for each training example e do compute the current output o j compare it to the target t j and update weights each execution of outer loop is an epoch for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold Q: when will the algorithm terminate?
Representation Limitations of a Perceptron Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space
Perceptron Learnability Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) Unfortunately, many functions (like parity) cannot be represented by LTU
Layered feed-forward network Output units Hidden units Input units
“Executing” neural networks Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node