A Quick Overview of Probability William W. Cohen Machine Learning 10-605.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Give qualifications of instructors: DAP
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Naïve-Bayes Classifiers Business Intelligence for Managers.
Naïve Bayes William W. Cohen. Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability.
Classification on high octane (1): Naïve Bayes (hopefully, with Hadoop) COSC 526 Class 3 Arvind Ramanathan Computational Science & Engineering Division.
Loglinear Models for Contingency Tables. Consider an IxJ contingency table that cross- classifies a multinomial sample of n subjects on two categorical.
Overview Full Bayesian Learning MAP learning
Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
1 Engineering Computation Part 5. 2 Some Concepts Previous to Probability RANDOM EXPERIMENT A random experiment or trial can be thought of as any activity.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Naïve Bayes and Hadoop Shannon Quinn.
Scalable Text Mining with Sparse Generative Models
Probability and Statistics Review Thursday Sep 11.
Computer Science 101 Circuit Design Algorithm. Circuit Design - The Problem The problem is to design a circuit that accomplishes a specified task. The.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Exercise Session 10 – Image Categorization
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.
Bayesian Networks. Male brain wiring Female brain wiring.
Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Aug 25th, 2001Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Andrew W. Moore Associate Professor School of Computer Science Carnegie.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
Text Classification, Active/Interactive learning.
1 Chapter 5 Sampling Distributions. 2 The Distribution of a Sample Statistic Examples  Take random sample of students and compute average GPA in sample.
Naive Bayes Classifier
LECTURE IV Random Variables and Probability Distributions I.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Optimal Bayes Classification
A Quick Overview of Probability William W. Cohen Machine Learning
Naïve Bayes William W. Cohen. Probabilistic and Bayesian Analytics Andrew W. Moore School of Computer Science Carnegie Mellon University
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
ENGIN112 L6: More Boolean Algebra September 15, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 6 More Boolean Algebra A B.
Today’s Topics Graded HW1 in Moodle (Testbeds used for grading are linked to class home page) HW2 due (but can still use 5 late days) at 11:55pm tonight.
Naïve Bayes Classifiers William W. Cohen. TREE PRUNING.
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 (with thanks to William Cohen of Carnegie Mellon)
A Quick Overview of Probability William W. Cohen Machine Learning Jan
2003/02/19 Chapter 2 1頁1頁 Chapter 2 : Basic Probability Theory Set Theory Axioms of Probability Conditional Probability Sequential Random Experiments Outlines.
CS 401R: Intro. to Probabilistic Graphical Models Lecture #6: Useful Distributions; Reasoning with Joint Distributions This work is licensed under a Creative.
Computer vision: models, learning and inference Chapter 2 Introduction to probability.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
A Quick Overview of Probability + Naïve William W. Cohen Machine Learning
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Oliver Schulte Machine Learning 726
Digital Logic.
CS 2750: Machine Learning Probability Review Density Estimation
Computer vision: models, learning and inference
Bayes Net Learning: Bayesian Approaches
Karnaugh Maps (K-Maps)
Gates Type AND denoted by X.Y OR denoted by X + Y NOR denoted by X + Y
Lecture 5 Binary Operation Boolean Logic. Binary Operations Addition Subtraction Multiplication Division.
Parametric Methods Berlin Chen, 2005 References:
Word embeddings (continued)
Lesson 3.3 Writing functions.
Presentation transcript:

A Quick Overview of Probability William W. Cohen Machine Learning

Big ML c (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001) Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

Twelve years later…. Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books approx 80B words 5-grams: 30Gb compressed, Gb uncompressed Each 5-gram contains frequency distribution over years – Wrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,E|any other fixed values of A,…,E with C=affect V effect )

Tuesday’s Lecture - Review Intro – Who, Where, When - administrivia – Why – motivations – What/How – assignments, grading, … Review - How to count and what to count – Big-O and Omega notation, example, … – Costs of i/o vs computation What sort of computations do we want to do in (large-scale) machine learning programs? – Probability

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb A B C

Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute Abstract : Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)

Using the Joint P(Poor Male) =

Using the Joint P(Poor) =

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference

Inference with the Joint

P(Male | Poor) = / = 0.612

Estimating the joint distribution Collect some data points Estimate the probability P(E1=e1 ^ … ^ En=en) as #(that row appears)/#(any row appears) …. GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN

Estimating the joint distribution For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data O(2 d ) d = #attributes (all binary) = C[ r i ]/ Total r i is “female,40.5+, poor”

Estimating the joint distribution For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity ? O(n) n = total size of input data k i = arity of attribute i

Estimating the joint distribution GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data k i = arity of attribute i For each combination of values r: – Total = C[ r ] = 0 For each data row r i – C[ r i ] ++ – Total ++

Estimating the joint distribution For each data row r i – If r i not in hash tables C,Total: Insert C[ r i ] = 0 – C[ r i ] ++ – Total ++ GenderHoursWealth g1h1w1 g2h2w2..…… gNhNwN Complexity? O(n) n = total size of input data m = size of the model O(m)

Another example….

Big ML c (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001) Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

An experiment Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books approx 80B words 5-grams: 30Gb compressed, Gb uncompressed Each 5-gram contains frequency distribution over years – Extract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle position about 20 “disk hours” approx 100M occurrences approx 50k distinct n-grams --- not big – Wrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,E|any other subset,C=affect V effect )

Some of the Joint Distribution ABCDEp istheeffectofthe istheeffectofa Theeffectofthis tothiseffect:“ betheeffectofthe… …………… nottheeffectofany …………… doesnotaffectthegeneral doesnotaffectthequestion anymanneraffecttheprinciple

Another experiment Extracted all affect/effect 5-grams from the old (small) Reuters corpus – about 20k documents – about 723 n-grams, 661 distinct – Financial news, not novels or textbooks Tried to predict center word with: – Pr(C|A=a,B=b,D=d,E=e) – then P(C|A,B,D,C=effect V affect) – then P(C|B,D, C=effect V affect) – then P(C|B, C=effect V affect) – then P(C, C=effect V affect)

EXAMPLES “The cumulative _ of the”  effect (1.0) “Go into _ on January”  effect (1.0) “From cumulative _ of accounting” not present – Nor is ““From cumulative _ of _” – But “_ cumulative _ of _”  effect (1.0) “Would not _ Finance Minister” not present – But “_ not _ _ _”  affect (0.9625)

Performance summary PatternUsedErrors P(C|A,B,D,E)1011 P(C|A,B,D)1576 P(C|B,D)16313 P(C|B)24478 P(C)5831

Probability - what you need to really, really know Probabilities are cool Random variables and events The Axioms of Probability Independence, binomials, multinomials Conditional probabilities Bayes Rule MLE’s, smoothing, and MAPs The joint distribution Inference Density estimation and classification

Copyright © Andrew W. Moore Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Density Estimator Probability Input Attributes

Copyright © Andrew W. Moore Density Estimation Compare it against the two other major kinds of models: Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values

Copyright © Andrew W. Moore Density Estimation  Classification Density Estimator P( x,y) Input Attributes Classifier Prediction of categorical output Input Attributes x One of y1, …., yk Class To classify x 1.Use your estimator to compute P( x, y1), …., P( x, yk) 2.Return the class y* with the highest predicted probability Ideally is correct with P(x,y*) = P(x,y *)/ (P( x,y1) + …. + P( x,yk)) ^ ^ ^^ ^^ ^ Binary case: predict POS if P( x )>0.5 ^

Classification vs Density Estimation ClassificationDensity Estimation

Classification vs density estimation