Statistical Learning Methods Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5) CMSC 421 – Fall 2006.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Simple Neural Nets For Pattern Classification
Learning 3: Bayesian Networks Reinforcement Learning Genetic Algorithms based on material from Ray Mooney, Daphne Koller, Kevin Murphy.
Learning R&N: ch 19, ch 20. Types of Learning Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern.
20.5 Nerual Networks Thanks: Professors Frank Hoffmann and Jiawei Han, and Russell and Norvig.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Neural Networks Marco Loog.
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CS 484 – Artificial Intelligence
For Monday Read chapter 22, sections 1-3 No homework.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Crash Course on Machine Learning
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
Computer Science and Engineering
Artificial Neural Networks
Multi-Layer Perceptrons Michael J. Watts
Artificial Intelligence Neural Networks ( Chapter 9 )
Machine Learning Chapter 4. Artificial Neural Networks
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)
NEURAL NETWORKS FOR DATA MINING
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
1 Pattern Classification X. 2 Content General Method K Nearest Neighbors Decision Trees Nerual Networks.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
For Friday No reading Take home exam due Exam 2. For Monday Read chapter 22, sections 1-3 FOIL exercise due.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Artificial Neural Networks Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.
EEE502 Pattern Recognition
Perceptrons Michael J. Watts
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Neural networks.
Neural Networks Review
CS 388: Natural Language Processing: Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Bayes Net Learning: Bayesian Approaches
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Data Mining Lecture 11.
Data Mining with Neural Networks (HK: Chapter 7.5)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
David Kauchak CS158 – Spring 2019
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Statistical Learning Methods Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5) CMSC 421 – Fall 2006

Statistical Approaches Statistical Learning (20.1) Naïve Bayes (20.2) Instance-based Learning (20.4) Neural Networks (20.5)

Statistical Learning (20.1)

Example: Candy Bags Candy comes in two flavors: cherry ( ) and lime (  ) Candy is wrapped, can’t tell which flavor until opened There are 5 kinds of bags of candy: H 1 = all cherry H 2 = 75% cherry, 25% lime H 3 = 50% cherry, 50% lime H 4 = 25% cherry, 75% lime H 5 = 100% lime Given a new bag of candy, predict H Observations: D 1, D 2, D 3, …

Bayesian Learning Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) Now, if we want to predict some unknown quantity X

Bayesian Learning cont. Calculating P(h|d) Assume the observations are i.i.d.— independent and identically distributed likelihoodprior

Example: Hypothesis Prior over h 1, …, h 5 is {0.1,0.2,0.4,0.2,0.1} Data: Q1: After seeing d 1, what is P(h i |d 1 )? Q2: After seeing d 1, what is P(d 2 = |d 1 )?

Making Statistical Inferences Bayesian – predictions made using all hypothesis, weighted by their probabilities MAP – maximum a posteriori uses the single most probable hypothesis to make prediction often much easier than Bayesian; as we get more and more data, closer to Bayesian optimal ML – maximum likelihood assume uniform prior over H when

Naïve Bayes (20.2)

Naïve Bayes aka Idiot Bayes particularly simple BN makes overly strong independence assumptions but works surprisingly well in practice…

Bayesian Diagnosis suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d 1, …, d n suppose there are m boolean symptoms, E 1, …, E m how do we make diagnosis? we need:

Naïve Bayes Assumption Assume each piece of evidence (symptom) is independent give the diagnosis then what is the structure of the corresponding BN?

Naïve Bayes Example possible diagnosis: Allergy, Cold and OK possible symptoms: Sneeze, Cough and Fever WellColdAllergy P(d) P(sneeze|d) P(cough|d) P(fever|d) my symptoms are: sneeze & cough, what is the diagnosis?

Learning the Probabilities aka parameter estimation we need P(d i ) – prior P(e k |d i ) – conditional probability use training data to estimate

Maximum Likelihood Estimate (MLE) use frequencies in training set to estimate: where n x is shorthand for the counts of events in training set

Example: DSneezeCoughFever Allergyyesno Wellyesno Allergyyesnoyes Allergyyesno Coldyes Allergyyesno Wellno Wellno Allergyno Allergyyesno what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)?

Laplace Estimate (smoothing) use smoothing to eliminate zeros: where n is number of possible values for d and e is assumed to have 2 possible values many other smoothing schemes…

Comments Generally works well despite blanket assumption of independence Experiments show competitive with decision trees on some well known test sets (UCI) handles noisy data

Learning more complex Bayesian networks Two subproblems: learning structure: combinatorial search over space of networks learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

Instance-based Learning

Instance/Memory-based Learning Non-parameteric hypothesis complexity grows with the data Memory-based learning Construct hypotheses directly from the training data itself

Nearest Neighbor Methods To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class x k=1 k=6

Issues Distance measure Most common: euclidean Better distance measures: normalize each variable by standard deviation For discrete data, can use hamming distance Choosing k Increasing k reduces variance, increases bias For high-dimensional space, problem that the nearest neighbor may not be very close at all! Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. Indexing the data can help; for example KD trees

Neural Networks (20.5)

Neural function Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength There are about neurons and about synapses in the human brain

Biology of a neuron

Brain structure Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don’t know how different functions are “assigned” or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Our neural networks are not nearly as complex or intricate as the actual brain structure

Comparison of computing power Computers are way faster than neurons… But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster

Neural networks Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output

Neural unit

Linear Threshold Unit (LTU)

Sigmoid Unit

Neural Computation McCollough and Pitt (1943)showed how LTU can be use to compute logical functions AND? OR? NOT? Two layers of LTUs can represent any boolean function

Learning Rules Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes a single linear threshold unit

Perceptron Learning rule If the target output for unit j is t j Equivalent to the intuitive rules: If output is correct, don’t change the weights If output is low (o j =0, t j =1), increment weights for all the inputs which are 1 If output is high (o j =1, t j =0), decrement weights for all inputs which are 1 Must also adjust threshold. Or equivalently assume there is a weight w j0 for an extra input unit that has o 0 =1

Perceptron Learning Algorithm Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or random) Until outputs for all training examples are correct  for each training example e do compute the current output o j compare it to the target t j and update weights each execution of outer loop is an epoch for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold Q: when will the algorithm terminate?

Representation Limitations of a Perceptron Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

Perceptron Learnability Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) Unfortunately, many functions (like parity) cannot be represented by LTU

Layered feed-forward network Output units Hidden units Input units

Backpropagation Algorithm

“Executing” neural networks Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node

Neural Nets for Face Recognition 90% accurate learning head pose, and recognizing 1-of-20 faces

Summary: Statistical Learning Methods Statistical Inference use likehood of data and prob of hypothesis to predict value for next instance  Bayesian  MAP  ML Naïve Bayes Nearest Neighbor Neural Networks