Donald “Godel” Rumsfeld ''Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns, there.

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

Machine Learning: Intro and Supervised Classification
Slides from: Doug Gray, David Poole
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
18 LEARNING FROM OBSERVATIONS
Learning From Observations
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Neural Networks Marco Loog.
Evaluation.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
LEARNING DECISION TREES
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Part I: Classification and Bayesian Learning
Machine learning Image source:
A classification learning example
Machine Learning Chapter 3. Decision Tree Learning
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning CSE 681 CH2 - Supervised Learning.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining and Decision Support
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 18 Section 1 – 3 Learning from Observations.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Observations Inductive Learning Decision Trees Ensembles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning from Observations
Learning from Observations
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Machine Learning: Lecture 3
Overview of Machine Learning
Learning from Observations
Learning from Observations
Decision trees One possible representation for hypotheses
Presentation transcript:

Donald “Godel” Rumsfeld ''Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns, there are things we know we know,'' Rumsfeld said. ''We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don't know we don't know.'' »Rumsfeld talking about the reported lack of WMDs in Iraq (News Conference, April 2003) Winner of 2003 Foot in the Mouth Award ''We think we know what he means,'' said Plain English Campaign spokesman John Lister. ''But we don't know if we really know.''

12/2 Decisions.. Decisions Vote on final In-class (16 th 2:40pm) OR Take-home (will be due by 16 th ) Clarification on HW5 Participation survey

Learning Dimensions: What can be learned? --Any of the boxes representing the agent’s knowledge --action description, effect probabilities, causal relations in the world (and the probabilities of causation), utility models (sort of through credit assignment), sensor data interpretation models What feedback is available? --Supervised, unsupervised, “reinforcement” learning --Credit assignment problem What prior knowledge is available? -- “Tabularasa” (agent’s head is a blank slate) or pre-existing knowledge

Inductive Learning (Classification Learning) Given a set of labeled examples, and a space of hypotheses –Find the rule that underlies the labeling (so you can use it to predict future unlabeled examples) –Tabularasa, fully supervised Idea: – Loop through all hypotheses Rank each hypothesis in terms of its match to data Pick the best hypothesis Closely related to * Function learning or curve-fitting (regression)

A classification learning example Predicting when Rusell will wait for a table --similar to predicting credit card fraud, predicting when people are likely to respond to junk mail

Ranking hypotheses A good hypothesis will have fewest false positives (F h + ) and fewest false negatives (F h - ) [Ideally, we want them to be zero] Rank(h) = f(F h +, F h - ) --f depends on the domain --in a medical domain False negatives are penalized more --in a junk-mailing domain, False negatives are penalized less H1: Russell waits only in italian restaurants false +ves: X10, false –ves: X1,X3,X4,X8,X12 H2: Russell waits only in cheap french restaurants False +ves: False –ves: X1,X3,X4,X6,X8,X12 The hypothesis classifies the example as +ve, but it is actually -ve

When do you know you have learned the concept well? You can classify all new instances (test cases) correctly always Always –May be the training samples are not completely representative of the test samples –So, we go with “probably” Correctly? –May be impossible if the training data has noise (the teacher may make mistakes too) –So, we go with “approximately” The goal of a learner then is to produce a probably approximately correct (PAC) hypothesis, for a given approximation (error rate)  and probability . When is a learner A better than learner B? –For the same  bounds, A needs fewer trailing samples than B to reach PAC. Learning Curves

PAC learning Note: This result only holds for finite hypothesis spaces (e.g. not valid for the space of line hypotheses!)

Inductive Learning (Classification Learning) Given a set of labeled examples, and a space of hypotheses –Find the rule that underlies the labeling (so you can use it to predict future unlabeled examples) –Tabularasa, fully supervised Idea: – Loop through all hypotheses Rank each hypothesis in terms of its match to data Pick the best hypothesis Main variations: Bias: the “sort” of rule are you looking for? –If you are looking for only conjunctive hypotheses, there are just 3 n –Search: – Greedy search – Decision tree learner – Systematic search – Version space learner – Iterative search – Neural net learner The main problem is that the space of hypotheses is too large Given examples described in terms of n boolean variables There are 2 different hypotheses For 6 features, there are 18,446,744,073,709,551,616 hypotheses 2n2n It can be shown that sample complexity of PAC learning is proportional to 1/ , 1/  AND log |H|

IMPORTANCE OF Bias in Learning… “Gavagai” example. -The “whole object” bias in language learning. More expressive the bias, larger the hypothesis space  Slower the learning --Line fitting is faster than curve fitting --Line fitting may miss non-line patterns

Uses different biases in predicting Russel’s waiting habbits Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs Neural Nets --Examples are used to --Learn topology --Learn edge weights Decision Trees --Examples are used to --Learn topology --Order of questions If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules

Mirror, Mirror, on the wall Which learning bias is the best of all? Well, there is no such thing, silly! --Each bias makes it easier to learn some patterns and harder (or impossible) to learn others: -A line-fitter can fit the best line to the data very fast but won’t know what to do if the data doesn’t fall on a line --A curve fitter can fit lines as well as curves… but takes longer time to fit lines than a line fitter. -- Different types of bias classes (Decision trees, NNs etc) provide different ways of naturally carving up the space of all possible hypotheses So a more reasonable question is: -- What is the bias class that has a specialization corresponding to the type of patterns that underlie my data? -- In this bias class, what is the most restrictive bias that still can capture the true pattern in the data? --Decision trees can capture all boolean functions --but are faster at capturing conjunctive boolean functions --Neural nets can capture all boolean or real-valued functions --but are faster at capturing linearly seperable functions --Bayesian learning can capture all probabilistic dependencies But are faster at capturing single level dependencies (naïve bayes classifier)

12/4 Interactive Review next class!! Minh’s review: Next Monday evening Rao’s review: Reading day? Vote on participation credit: Should I consider participation credit or not?

Fitting test cases vs. predicting future cases The BIG TENSION…. Why Simple is Better? Why not the 3rd? Review

Uses different biases in predicting Russel’s waiting habbits Neural Nets --Examples are used to --Learn topology --Learn edge weights Decision Trees --Examples are used to --Learn topology --Order of questions Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules

Learning Decision Trees---How? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively Which one to pick?

Depending on the order we pick, we can get smaller or bigger trees Which tree is better? Why do you think so??

Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element) Would you split on patrons or Type?

N+ N- N1+N1-N1+N1- N2+N2-N2+N2- Nk+Nk-Nk+Nk- Splitting on feature f k P + : N + /(N + +N - ) P - : N - /(N + +N - ) I(P +,, P - ) = -P + log(P + ) - P - log(P - ) I(P1 +,, P1 - )I(P2 +,, P2 - )I(Pk +,, Pk - )  [N i+ + N i- ]/[N + + N - ] I(Pi +,, Pi - ) i=1 k The difference is the information gain So, pick the feature with the largest Info Gain I.e. smallest residual info The Information Gain Computation Given k mutually exclusive and exhaustive events E 1 ….E k whose probabilities are p 1 ….p k The “information” content (entropy) is defined as  i -p i log 2 p i # expected comparisons needed to tell whether a given example is +ve or -ve

A simple example V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) = 0 V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 So Anxious is the best attribute to split on Once you split on Anxious, the problem is solved I(1/2,1/2) = -1/2 *log 1/2 -1/2 *log 1/2 = 1/2 + 1/2 =1 I(1,0) = 1*log * log 0 = 0

Learning curves… Given N examples, partition them into N tr the training set and N test the test instances Loop for i=1 to |N tr | Loop for N s in subsets of N tr of size I Train the learner over N s Test the learned pattern over N test and compute the accuracy (%correct) Evaluating the Decision Trees Russell Domain “Majority” function (say yes if majority of attributes are yes) Lesson: Every bias makes something easier to learn and others harder to learn…

Problems with Info. Gain. Heuristics Feature correlation: The Costanza party problem –No obvious solution… Overfitting: We may look too hard for patterns where there are none –E.g. Coin tosses classified by the day of the week, the shirt I was wearing, the time of the day etc. –Solution: Don’t consider splitting if the information gain given by the best feature is below a minimum threshold Can use the  2 test for statistical significance –Will also help when we have noisy samples… We may prefer features with very high branching –e.g. Branch on the “universal time string” for Russell restaurant example – Branch on social security number to look for patterns on who will get A –Solution: “gain ratio” --ratio of information gain with the attribute A to the information content of answering the question “What is the value of A?” The denominator is smaller for attributes with smaller domains.

Neural Network Learning Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces? Mathematically, a surface is a function –Need a way of learning functions –“Threshold units”

“Neural Net” is a collection of with interconnections Feed Forward Uni-directional connections Single Layer Multi-Layer Recurrent Bi-directional connections Any linear decision surface can be represented by a single layer neural net Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net Can act as associative memory differentiable threshold units = 1 if w 1 I 1 +w 2 I 2 > k = 0 otherwise

A Threshold Unit …is sort of like a neuron Threshold Functions differentiable The “Brain” Connection

Perceptron Networks What happened to the “Threshold”? --Can model as an extra weight with static input w1w1 w2w2 t=k I1I1 I2I2 w1w1 w2w2 w 0 = k I 0 =-1 t=0 ==

Can Perceptrons Learn All Boolean Functions? --Are all boolean functions linearly separable?

Perceptron Training in Action A nice applet at: Any line that separates the +ve & –ve examples is a solution --may want to get the line that is in some sense equidistant from the nearest +ve/-ve --Need “support vector machines” for that

Majority function Russell Domain Perceptron Decision Trees Perceptron Comparing Perceptrons and Decision Trees in Majority Function and Russell Domain Majority function is linearly seperable..Russell domain is apparently not.... Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain