Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Artificial Neural Networks (1)
Perceptron Learning Rule
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Linear Classifiers (perceptrons)
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Kostas Kontogiannis E&CE
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Machine Learning Neural Networks
Simple Neural Nets For Pattern Classification
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 Machine Learning: Symbol-based 10a 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.
Statistical Learning Methods Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5) CMSC 421 – Fall 2006.
Neural Networks Marco Loog.
1 Chapter 19 Knowledge in Learning Version spaces examples Additional sources used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Data Mining with Neural Networks (HK: Chapter 7.5)
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
For Monday Read chapter 22, sections 1-3 No homework.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function Networks
Crash Course on Machine Learning
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
2101INT – Principles of Intelligent Systems Lecture 10.
For Friday Read chapter 18, section 7 Program 3 due.
Naive Bayes Classifier
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
1 Pattern Classification X. 2 Content General Method K Nearest Neighbors Decision Trees Nerual Networks.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
Classification Techniques: Bayesian Classification
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
1 Inductive Learning (continued) Chapter 19 Slides for Ch. 19 by J.C. Latombe.
Linear Classification with Perceptrons
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
For Friday No reading Take home exam due Exam 2. For Monday Read chapter 22, sections 1-3 FOIL exercise due.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Artificial Neural Networks Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.
EEE502 Pattern Recognition
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Perceptrons Michael J. Watts
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Chapter 6 Neural Network.
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Neural networks.
Neural Networks Review
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining Lecture 11.
The Naïve Bayes (NB) Classifier
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Inductive Learning (2/2) Version Space and PAC Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Learning R&N: ch 19, ch 20

Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern discovery Reinforcement Learning – learning MDPs, online learning

Supervised Learning Someone gives you a bunch of examples, telling you what each one is Eventually, you figure out the mapping from properties (features) of the examples to their type (classification)

Supervised Learning Logic-based approaches: learn a function f(X)  (true,false) Statistical/Probabilistic approaches Learn a probability distribution p(Y|X)

Outline Logic-based Approaches Decision Trees (lecture 24) Version Spaces (today) PAC learning (we won’t cover,  ) Instance-based approaches Statistical Approaches

Predicate-Learning Methods Version space Explicit representation of hypothesis space H Need to provide H with some “structure”

Version Spaces The “version space” is the set of all hypotheses that are consistent with the training instances processed so far. An algorithm: V := H ;; the version space V is ALL hypotheses H For each example e:  Eliminate any member of V that disagrees with e  If V is empty, FAIL Return V as the set of consistent hypotheses

Version Spaces: The Problem PROBLEM: V is huge!! Suppose you have N attributes, each with k possible values Suppose you allow a hypothesis to be any disjunction of instances There are k N possible instances  |H| = 2 k N If N=5 and k=2, |H| = 2 32 !!

Version Spaces: The Tricks First Trick: Don’t allow arbitrary disjunctions Organize the feature values into a hierarchy of allowed disjunctions, e.g. any-color yellowwhite pale blue dark black Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed) Second Trick: Define a partial ordering on H (“general to specific”) and only keep track of the upper bound and lower bound of the version space RESULT: An incremental, possibly efficient algorithm!

Rewarded Card Example (r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K)  ANY-RANK(r) (r=1) v … v (r=10)  NUM(r) (r=J) v (r=Q) v (r=K)  FACE(r) (s=  ) v (s=  ) v (s=  ) v (s= )  ANY-SUIT(s) (s=  ) v (s=  )  BLACK(s) (s=  ) v (s= )  RED(s) A hypothesis is any sentence of the form: R(r)  S(s)  IN-CLASS([r,s]) where: R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j) S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)

Simplified Representation For simplicity, we represent a concept by rs, with: r  {a, n, f, 1, …, 10, j, q, k} s  {a, b, r, , , , } For example: n  represents: NUM(r)  (s=  )  IN-CLASS([r,s]) aa represents: ANY-RANK(r)  ANY-SUIT(s)  IN-CLASS([r,s])

Extension of a Hypothesis The extension of a hypothesis h is the set of objects that satisfies h Examples: The extension of f  is: {j , q , k  } The extension of aa is the set of all cards

More General/Specific Relation Let h 1 and h 2 be two hypotheses in H h 1 is more general than h 2 iff the extension of h 1 is a proper superset of the extension of h 2 Examples: aa is more general than f  f is more general than q fr and nr are not comparable

More General/Specific Relation Let h 1 and h 2 be two hypotheses in H h 1 is more general than h 2 iff the extension of h 1 is a proper superset of the extension of h 2 The inverse of the “more general” relation is the “more specific” relation The “more general” relation defines a partial ordering on the hypotheses in H

Example: Subset of Partial Order aa naab nb nn 44 4b aa 4a

Construction of Ordering Relation 110 n a f jk ……   b a r 

G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V

G-Boundary / S-Boundary of V A hypothesis in V is most general iff no hypothesis in V is more general G-boundary G of V: Set of most general hypotheses in V A hypothesis in V is most specific iff no hypothesis in V is more specific S-boundary S of V: Set of most specific hypotheses in V

aa naab nb nn 44 4b aa 4a Example: G-/S-Boundaries of V aa 44 11 k …… Now suppose that 4  is given as a positive example S G We replace every hypothesis in S whose extension does not contain 4  by its generalization set

Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Here, both G and S have size 1. This is not the case in general!

Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Let 7  be the next (positive) example Generalization set of 4  The generalization set of an hypothesis h is the set of the hypotheses that are immediately more general than h

Example: G-/S-Boundaries of V aa naab nb nn 44 4b aa 4a Let 7  be the next (positive) example

Example: G-/S-Boundaries of V aa naab nb nn aa Let 5 be the next (negative) example Specialization set of aa

Example: G-/S-Boundaries of V ab nb nn aa G and S, and all hypotheses in between form exactly the version space

Example: G-/S-Boundaries of V ab nb nn aa Do 8 , 6 , j  satisfy CONCEPT? Yes No Maybe At this stage …

Example: G-/S-Boundaries of V ab nb nn aa Let 2  be the next (positive) example

Example: G-/S-Boundaries of V ab nb Let j  be the next (negative) example

Example: G-/S-Boundaries of V nb + 4  7  2  – 5 j  NUM(r)  BLACK(s)  IN-CLASS([r,s])

Example: G-/S-Boundaries of V ab nb nn aa … and let 8  be the next (negative) example Let us return to the version space … The only most specific hypothesis disagrees with this example, so no hypothesis in H agrees with all examples

Example: G-/S-Boundaries of V ab nb nn aa … and let j be the next (positive) example Let us return to the version space … The only most general hypothesis disagrees with this example, so no hypothesis in H agrees with all examples

Version Space Update 1.x  new example 2.If x is positive then (G,S)  POSITIVE-UPDATE(G,S,x) 3.Else (G,S)  NEGATIVE-UPDATE(G,S,x) 4.If G or S is empty then return failure

POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x

POSITIVE-UPDATE(G,S,x) 2.Minimally generalize all hypotheses in S until they are consistent with x Using the generalization sets of the hypotheses

POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x 2.Minimally generalize all hypotheses in S until they are consistent with x 3.Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G This step was not needed in the card example

POSITIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in G that do not agree with x 2.Minimally generalize all hypotheses in S until they are consistent with x 3.Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G 4.Remove from S every hypothesis that is more general than another hypothesis in S 5.Return (G,S)

NEGATIVE-UPDATE(G,S,x) 1.Eliminate all hypotheses in S that do not agree with x 2.Minimally specialize all hypotheses in G until they are consistent with x 3.Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S 4.Remove from G every hypothesis that is more specific than another hypothesis in G 5.Return (G,S)

Example-Selection Strategy (aka Active Learning) Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps

aa naab nb nn aa Example 9  ? j ? j  ?

Example-Selection Strategy Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses Then a single hypothesis will be isolated in O(log |H|) steps But picking the object that eliminates half the version space may be expensive

Noise If some examples are misclassified, the version space may collapse Possible solution: Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc…

Version Spaces Useful for understanding logical (consistency-based) approaches to learning Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis) Don’t handle noise well

VSL vs DTL Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one Version space learning (VSL) is incremental DTL can produce simplified hypotheses that do not agree with all examples DTL has been more widely used in practice

Statistical Approaches Instance-based Learning (20.4) Statistical Learning (20.1) Neural Networks (20.5)

Nearest Neighbor Methods To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class x k=1 k=6

Issues Distance measure Most common: euclidean Better distance measures: normalize each variable by standard deviation Choosing k Increasing k reduces variance, increases bias For high-dimensional space, problem that the nearest neighbor may not be very close at all! Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. Indexing the data can help; for example KD trees

Bayesian Learning

Example: Candy Bags Candy comes in two flavors: cherry ( ) and lime (  ) Candy is wrapped, can’t tell which flavor until opened There are 5 kinds of bags of candy: H 1 = all cherry H 2 = 75% cherry, 25% lime H 3 = 50% cherry, 50% lime H 4 = 25% cherry, 75% lime H 5 = 100% lime Given a new bag of candy, predict H Observations: D 1, D 2, D 3, …

Bayesian Learning Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) Now, if we want to predict some unknown quantity X

Bayesian Learning cont. Calculating P(h|d) Assume the observations are i.i.d.— independent and identically distributed likelihoodprior

Example: Hypothesis Prior over h 1, …, h 5 is {0.1,0.2,0.4,0.2,0.1} Data: Q1: After seeing d 1, what is P(h i |d 1 )? Q2: After seeing d 1, what is P(d 2 = |d 1 )?

Statistical Learning Approaches Bayesian MAP – maximum a posteriori ** MLE – assume uniform prior over H

Naïve Bayes aka Idiot Bayes particularly simple BN makes overly strong independence assumptions but works surprisingly well in practice…

Bayesian Diagnosis suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d 1, …, d n suppose there are m boolean symptoms, E 1, …, E m how do we make diagnosis? we need:

Naïve Bayes Assumption Assume each piece of evidence (symptom) is independent give the diagnosis then what is the structure of the corresponding BN?

Naïve Bayes Example possible diagnosis: Allergy, Cold and OK possible symptoms: Sneeze, Cough and Fever WellColdAllergy P(d) P(sneeze|d) P(cough|d) P(fever|d) my symptoms are: sneeze & cough, what is the diagnosis?

Learning the Probabilities aka parameter estimation we need P(d i ) – prior P(e k |d i ) – conditional probability use training data to estimate

Maximum Likelihood Estimate (MLE) use frequencies in training set to estimate: where n x is shorthand for the counts of events in training set

Example: DSneezeCoughFever Allergyyesno Wellyesno Allergyyesnoyes Allergyyesno Coldyes Allergyyesno Wellno Wellno Allergyno Allergyyesno what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)?

Laplace Estimate (smoothing) use smoothing to eliminate zeros: where n is number of possible values for d and e is assumed to have 2 possible values many other smoothing schemes…

Comments Generally works well despite blanket assumption of independence Experiments show competitive with decision trees on some well known test sets (UCI) handles noisy data

Learning more complex Bayesian networks Two subproblems: learning structure: combinatorial search over space of networks learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

Neural Networks

Neural function Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength There are about neurons and about synapses in the human brain

Biology of a neuron

Brain structure Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don’t know how different functions are “assigned” or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Our neural networks are not nearly as complex or intricate as the actual brain structure

Comparison of computing power Computers are way faster than neurons… But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster

Neural networks Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output

Neural unit

Model Neuron Neuron modeled a unit j weights on input unit I to j, w ji net input to unit j is: threshold T j o j is 1 if net j > T j

Neural Computation McCollough and Pitt (1943)showed how LTU can be use to compute logical functions AND? OR? NOT? Two layers of LTUs can represent any boolean function

Learning Rules Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes a single linear threshold unit

Perceptron Learning rule If the target output for unit j is t j Equivalent to the intuitive rules: If output is correct, don’t change the weights If output is low (o j =0, t j =1), increment weights for all the inputs which are 1 If output is high (o j =1, t j =0), decrement weights for all inputs which are 1 Must also adjust threshold. Or equivalently asuume there is a weight w j0 for an extra input unit that has o 0 =1

Perceptron Learning Algorithm Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or random) Until outputs for all training examples are correct  for each training example e do compute the current output o j compare it to the target t j and update weights each execution of outer loop is an epoch for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold Q: when will the algorithm terminate?

Representation Limitations of a Perceptron Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

Perceptron Learnability Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) Unfortunately, many functions (like parity) cannot be represented by LTU

Layered feed-forward network Output units Hidden units Input units

“Executing” neural networks Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node