Midterm Review Rao Vemuri 16 Oct 2013. Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Provides practical learning algorithms
also known as the “Perceptron”
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Kostas Kontogiannis E&CE
What is Statistical Modeling
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Simple Neural Nets For Pattern Classification
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Lecture 08 Classification-based Learning
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Classification.
Computer vision: models, learning and inference
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Inductive learning Simplest form: learn a function from examples
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Chapter 9 Neural Network.
Appendix B: An Example of Back-propagation algorithm
Naive Bayes Classifier
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Classification Techniques: Bayesian Classification
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Classification And Bayesian Learning
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Bayesian Learning Provides practical learning algorithms
Classification Today: Basic Problem Decision Trees.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Perceptrons Michael J. Watts
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
CEE 6410 Water Resources Systems Analysis
Naive Bayes Classifier
Data Mining Lecture 11.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Midterm Review Rao Vemuri 16 Oct 2013

Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature – The last column is a class label/output – Mathematically, you are given a set of ordered pairs {(x,y)} where x is a vector. The elements of this vector are attributes or features – The table is referred to as D, the data set – Our goal is to build a model M (or hypothesis h)

Types of Problems Classification: Given a data set D, develop a model (hypothesis) such that the model can predict the class label (last column) of a new instance not seen before Regression: Given a data set D, develop a model (hypothesis) such that the model can predict the (real-valued) output (last column) of a new input not seen before

Types of Problems Density Distribution: Given a data set D, develop a model (hypothesis) such that the model can predict the probability distribution from which the data set is drawn.

Decision Trees We talked mostly about ID3 – Entropy – Gain in Entropy Given an Experience Table, you must be able to decide on what attribute to split using entropy method and build a DT There are other methods like Gini, but you are not responsible for those

Advantages of DT Simple to understand and easy to interpret. When we fit a decision tree to a training dataset, the top few nodes on which the tree is split are essentially the most important variables within the dataset and feature selection is completed automatically! If we have a dataset which measures revenue in millions and loan age in years, say; this will require some form of normalization or scaling before we can fit a regression model and interpret the coefficients. Such variable transformations are not required with decision trees because the tree structure will remain the same with or without the transformation.regression model and interpret the coefficients

Disadvantages of DT For data including categorical variables with different number of levels, information gain in decision trees are biased in favor of those attributes with more levels.information gain in decision trees Calculations can get very complex particularly if many values are uncertain and/or if many outcomes are linked.

Mathematical Model of a Neuron A neuron produces an output if the weighted sum of the inputs exceeds a threshold, theta. For convenience, we represent the threshold as w_0 connected to an input +1 Now the net input to a neuron can be written as the dot (inner) product of the weight vector w and input vector x. The output is f(net input)

Perceptron In a Perceptron, the function f is the signum (sign) function. That is, the output is +1 if the net input is > 0 and -1 if <= 0 Training rule: New wt = old wt + eta (error) input Error = target output – actual output = (t – y) NOTE: The error is always +/- 2 or 0 Weight updates occur only when error =/ 0

Adaline In an Adaline, the function f(x) = x. That is, the output is the same as net input Training rule: New wt = old wt + eta (error) input Error = target output – actual output = (t – y)

Delta Rule In Delta Rule, the function f is the sigmoid function. Now, the output is in [0,1] Training rule: New wt = old wt + eta (error) input Error = target output – actual output = (t – y) NOTE: The error is a real number

Generalized Delta Rule This is Delta Rule applied to multi-layered networks In multi-layered, feed-forward networks we only know the error (t-y) at the output stage, because t is only given at the output. So we can calculate weight updates at the output layer using the Delta Rule

Weight Updates at Hidden Level To calculate the weight updates at the hidden layer, we need “what the error should be” at the hidden unit(s). This is done by taking the output error and multiplying it by the weight between the said units, and adding the propagated values. Then the Delta Rule is applied again.

Basic Probability Formulas

Probability for Bayes Method Concept of independence is central In Machine Learning we are interested in determining the best hypothesis h from a set of hypotheses H, given training data set D In probability language, we want the most probable hypothesis, – given the training data set D – Any other information about the probabilities of various hypotheses in H (prior probabilities)

Two Roles for Bayesian Methods Provides practical learning algorithms: – Naive Bayes learning – Bayesian belief network learning – Combine prior knowledge (prior probabilities) with observed data – Requires prior probabilities Provides useful conceptual framework – Provides “gold standard” for evaluating other learning algorithms – Additional insight into Occam’s razor

Bayes Theorem

Notation P(h) = Initial probability (or prior probability) that hypothesis h holds P(D) = prior probability that data D will be observed (independent of any hypothesis) P(D|h) = probability that data D will be observed, given hypothesis h holds. P(h|D) = probability that h holds, given training data D. This is called posterior probability

Bayes Theorem for ML

Maximum Likelihood Hypothesis

Patient has Cancer or Not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore,.008 of the entire population have this cancer. P(cancer ) = P(¬cancer ) = P(+|cancer ) = P(−|cancer ) = P(+|¬cancer ) = P(−|¬cancer ) =

Medical Diagnosis Two alternatives – Patient has cancer – Patient has no cancer Data: Laboratory test with two outcomes – + Positive, Patient has cancer – - Negative, Patient has no cancer Prior Knowledge: – In the population only have cancer – Lab test is correct in 98% of positive cases – Lab test is correct in 97% of negative cases

Probability Notation P(cancer) = 0.008;P(~cancer) = P(+Lab|cancer) = 0.98;P(-Lab|cancer) =0.02 P(+Lab|~cancer)=0.03; P(-lab|~cancer)=0.97 This is the given data in probability notation. Notice the blue items are actually given and the red are inferred

Brute Force MAP Hypothesis Learner A new patient gets examined and the test says he has cancer. Does he? Doesn’t he? To find the MAP hypothesis, for each hypothesis h in H, calculate the posterior probabilities, P(h|D): P(+lab|cancer)P(can) = (0.98)(.008)= P(+lab|~cancer)P(~can) = (0.03)(.992)=0.0298

Posterior Probabilities

Genetic Algorithms I will NOT ask questions on Genetic Algorithms in the midterm examination I will not ask questions on MATLAB in the examination