Essential Probability & Statistics

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.

5/17/20151 Probabilistic Reasoning CIS 479/579 Bruce R. Maxim UM-Dearborn.

Probabilistic Ranking Principle

Parameter Estimation using likelihood functions Tutorial #1

Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, March 6, 2000 William.

Computer vision: models, learning and inference

Thanks to Nir Friedman, HU

1 Basic Probability Statistics 515 Lecture Importance of Probability Modeling randomness and measuring uncertainty Describing the distributions.

Language Modeling Approaches for Information Retrieval Rong Jin.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Relevance Feedback Users learning how to modify queries Response list must have least some relevant documents Relevance feedback `correcting' the ranks.

Recitation 1 Probability Review

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.

Probability. Statistical inference is based on a Mathematics branch called probability theory. If a procedure can result in n equally likely outcomes,

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Statistical Learning (From data to distributions).

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.

Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:

NLP. Introduction to NLP Very important for language processing Example in speech recognition: –“recognize speech” vs “wreck a nice beach” Example in.

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005 ChengXiang Zhai Department of Computer Science University.

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Statistical Language Models Hongning Wang CS 6501: Text Mining1.

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.

2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.

Mutually Exclusive & Independence PSME 95 – Final Project.

Concepts of Probability Introduction to Probability & Statistics Concepts of Probability.

Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.

Univariate Gaussian Case (Cont.)

Lecture 1.31 Criteria for optimal reception of radio signals.

Chapter 3: Probability Topics

Statistical Language Models

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs)

CS 416 Artificial Intelligence

Statistics for Business and Economics

Basic Probability Theory

Basic Probability aft A RAJASEKHAR YADAV.

Natural Language Processing

Still More Uncertainty

CSCI 5832 Natural Language Processing

CSCI 5822 Probabilistic Models of Human and Machine Learning

Honors Statistics From Randomness to Probability

Now it’s time to look at…

Bayesian Inference for Mixture Language Models

LECTURE 09: BAYESIAN LEARNING

LECTURE 07: BAYESIAN ESTIMATION

Probabilistic Ranking Principle

Bayes for Beginners Luca Chech and Jolanda Malamud

CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis

Parametric Methods Berlin Chen, 2005 References:

CS590I: Information Retrieval

CS590I: Information Retrieval

CS639: Data Management for Data Science

Naive Bayes Classifier

Mathematical Foundations of BME Reza Shadmehr

Language Models for TR Rong Jin

Presentation transcript:

Essential Probability & Statistics (Lecture for CS510 Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Prob/Statistics & Text Management Probability & statistics provide a principled way to quantify the uncertainties associated with natural language Allow us to answer questions like: Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

Basic Concepts in Probability Random experiment: an experiment with uncertain outcome (e.g., tossing a coin, picking a word from text) Sample space: all possible outcomes, e.g., Tossing 2 fair coins, S ={HH, HT, TH, TT} Event: ES, E happens iff outcome is in E, e.g., E={HH} (all heads) E={HH,TT} (same face) Impossible event ({}), certain event (S) Probability of Event : 1P(E) 0, s.t. P(S)=1 (outcome always in S) P(A B)=P(A)+P(B) if (AB)= (e.g., A=same face, B=different face)

Basic Concepts of Prob. (cont.) Conditional Probability :P(B|A)=P(AB)/P(A) P(AB) = P(A)P(B|A) =P(B)P(A|B) So, P(A|B)=P(B|A)P(A)/P(B) (Bayes’ Rule) For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A) Total probability: If A1, …, An form a partition of S, then P(B)= P(BS)=P(BA1)+…+P(B An) (why?) So, P(Ai|B)=P(B|Ai)P(Ai)/P(B) = P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)] This allows us to compute P(Ai|B) based on P(B|Ai)

Interpretation of Bayes’ Rule Hypothesis space: H={H1 , …, Hn} Evidence: E If we want to pick the most likely hypothesis H*, we can drop P(E) Posterior probability of Hi Prior probability of Hi Likelihood of data/evidence if Hi is true

Random Variable X: S   (“measure” of outcome) E.g., number of heads, all same face?, … Events can be defined according to X E(X=a) = {si|X(si)=a} E(Xa) = {si|X(si)  a} So, probabilities can be defined on X P(X=a) = P(E(X=a)) P(aX) = P(E(aX)) Discrete vs. continuous random variable (think of “partitioning the sample space”)

An Example: Doc Classification Sample Space S={x1,…, xn} For 3 topics, four words, n=? Topic the computer game baseball Conditional Probabilities: P(Esport | Ebaseball ), P(Ebaseball|Esport), P(Esport | Ebaseball, computer ), ... X1: [sport 1 0 1 1] X2: [sport 1 1 1 1] Thinking in terms of random variables Topic: T {“sport”, “computer”, “other”}, “Baseball”: B {0,1}, … P(T=“sport”|B=1), P(B=1|T=“sport”), ... X3: [computer 1 1 0 0] X4: [computer 1 1 1 0] X5: [other 0 0 1 1] … … Events Esport ={xi | topic(xi )=“sport”} Ebaseball ={xi | baseball(xi )=1} Ebaseball,computer = {xi | baseball(xi )=1 & computer(xi )=0} An inference problem: Suppose we observe that “baseball” is mentioned, how likely the topic is about “sport”? P(T=“sport”|B=1)  P(B=1|T=“sport”)P(T=“sport”) But, P(B=1|T=“sport”)=?, P(T=“sport” )=?

Getting to Statistics ... P(B=1|T=“sport”)=? (parameter estimation) If we see the results of a huge number of random experiments, then But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

Parameter Estimation General setting: Given a (hypothesized & probabilistic) model that governs the random experiment The model gives a probability of any data p(D|) that depends on the parameter  Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” Generally an optimization problem

Maximum Likelihood vs. Bayesian Maximum likelihood estimation “Best” means “data likelihood reaches maximum” Problem: small sample Bayesian estimation “Best” means being consistent with our “prior” knowledge and explaining data well Problem: how to define prior? Maximum a Posteriori (MAP) estimate

Illustration of Bayesian Estimation Bayesian inference: f()=? Posterior: p(|X) p(X|)p() Likelihood: p(X|) X=(x1,…,xN) Posterior Mean Prior: p()  0: prior mode 1: posterior mode ml: ML estimate

Maximum Likelihood Estimate Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(wi) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

What You Should Know Probability concepts: sample space, event, random variable, conditional prob. multinomial distribution, etc Bayes formula and its interpretation Statistics: Know how to compute maximum likelihood estimate