CS590I: Information Retrieval

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Probabilistic models Haixu Tang School of Informatics.
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Lecture (7) Random Variables and Distribution Functions.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Introduction to Probability Theory Rong Jin. Outline  Basic concepts in probability theory  Bayes’ rule  Random variable and distributions.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Maximum Likelihood (ML), Expectation Maximization (EM)
Thanks to Nir Friedman, HU
Language Modeling Approaches for Information Retrieval Rong Jin.
Relevance Feedback Users learning how to modify queries Response list must have least some relevant documents Relevance feedback `correcting' the ranks.
Crash Course on Machine Learning
Recitation 1 Probability Review
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Theory of Probability Statistics for Business and Economics.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Probability Theory, Bayes’ Rule & Random Variable Lecture 6.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Business Statistics,
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Applied statistics Usman Roshan.
Essential Probability & Statistics
Lecture 1.31 Criteria for optimal reception of radio signals.
Introduction to Probability Theory
Warm-up How many digits do you need to simulate heads or tails (or even or odd)? 2) To simulate an integer % probability of passing or failing?
Probability Theory and Parameter Estimation I
Chapter 5: Probability: What are the Chances?
Appendix A: Probability Theory
CS 2750: Machine Learning Probability Review Density Estimation
Classification of unlabeled data:
Chapter 5: Probability: What are the Chances?
Maximum Likelihood Estimation
Review of Probabilities and Basic Statistics
Latent Variables, Mixture Models and EM
Chapter 5: Probability: What are the Chances?
Review of Probability and Estimators Arun Das, Jason Rebello
Lecture 11 Sections 5.1 – 5.2 Objectives: Probability
Chapter 4 – Part 3.
Probability Probability underlies statistical inference - the drawing of conclusions from a sample of data. If samples are drawn at random, their characteristics.
More about Posterior Distributions
Statistical NLP: Lecture 4
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Chapter 5: Probability: What are the Chances?
Parametric Methods Berlin Chen, 2005 References:
CS590I: Information Retrieval
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Chapter 5: Probability: What are the Chances?
Mathematical Foundations of BME Reza Shadmehr
Chapter 5: Probability: What are the Chances?
Presentation transcript:

CS590I: Information Retrieval Introduction to Probability and Statistics Luo Si Department of Computer Science Purdue University

Probability and Statistics This is a world full of uncertainty flip a coin and predict outcome Analyze the uncertainty from incomplete knowledge Deductive vs. Inference analysis Probability: given models of random process/experiments; analyze what data can be generated Statistics: Given some observed data; inference what model generates the data draw conclusion on the whole population (data in the future)

Probability and Statistics: Outline Basic concepts of probability Conditional probability and Independence Common probability distributions Bayes’ Rule Statistics Inference Statistical learning Maximum likelihood estimation (MLE) Maximum a posterior (MAP) estimation Introduction to optimization

Basic Concepts in Probability Experiment: flip a coin twice Sample space: all possible outcomes of experiment S={HH, HT, TH, TT} Event: a subset of possible outcomes (whole space) A={TT} (all tails); B={HT,TH} (1 head and 1 tail) Probability of an event: a number indicates how probable is the event Axiom 1: Nonnegativity: Pr(A)  0 for all A belong to S Axiom 2: Normalization: Pr(S) = 1 Axiom 3: Additivity: for every sequence of disjoint events Example: Pr(A) = n(A)/N; n(A) size of A, N size of S

Conditional Probability If A and B are events with Pr(A)>0, the conditional probability of B given A is What is the probability of B happens if we already know A happens Calculate the probabilities Example Pr(Admitted | Dept1) Pr(Admitted | Dept2) Pr(Admitted | Dept1, Female) Pr(Admitted | Dept1, Male)

Independence Two events A and B are independent iff Pr(A, B) = Pr(A)Pr(B) The probability of both A and B happens is: probability of A happens times probability of B happens Two events do not have influence on each other Example: Is admission independent from gender? Pr(admitted, male)=60/800=7.5% Pr(admitted)*Pr(male)=110/800*500/800=8.5% Not independent

Independence Two events A and B are independent iff Pr(A, B) = Pr(A)Pr(B) This is equal to Pr(A|B)=Pr(A) Example Not independent Pr(admitted | male) = 60/500 = 12% Pr(admitted) = 110/800 = 13.75%

Conditional Independence Events A and B are conditionally independent given C iff Pr(A,B|C)=Pr(A|C)Pr(B|C) If we know the outcome of event C, then outcomes of event A and B are independent Example Conditional independent Pr(Male, Admitted | Dept1)=40/500=8% Pr(Admitted|Dept1)*Pr(male|Dept1)=50/500*400/500=8%

Common Probability Distribution Different types of probability distributions associate uncertain outcomes for different physical phenomena Flip a coin: Bernoulli/Binominal Flip a dice (Write a document with several words): Multinomial Random select a point close to a specific point: Gaussian Probability mass/density distribution Define how probable the random outcome is a specific event? P(X=x) for x in S Random outcome/variable (e.g., side of a coin) Specific data point (e.g, head or tail)

Common Probability Distribution Some properties of probability mass/density distribution Expectation: the average value of outcomes Example: the average outcome of a dice 1/6*1+1/6*2+1/6*3+1/6*4+1/6*5+1/6*6=21/6=3.5 Variance: how diverse are the outcomes (deviation from expectation) Example: the average outcome of a coin (1 for head, 0 for tail) 1/2*(0-1/2)2 +1/2*(1-1/2)2 =1/4

Common Probability Distribution Bernoulli/Binomial Model binary outcomes: side of a coin, whether a term appears in a document, whether a email is a spam… Bernoulli: binary outcome (i.e., 0 or 1), with probability p to be 1 Expectation: p Variance: p(1-p) Binomial: n outcomes of a binary variable, the probability p to be 1, what is probability of outcome 1 appear x times Expectation: np Variance: np(1-p)

Common Probability Distribution Multinomial Model multiple outcomes: side of a dice; topic of documents; occurrences of terms appear within a document; Multinomial: n outcomes of a variable with multiple values (v1..vn), the probability p1 to be v1,…, the probability pk to be vk, what is probability of v1 appear x1 times,… vk appear xk times Expectation: E(Xi)=npi Variance: Var(Xi)=npi(1-pi)

Common Probability Distribution Multinomial Examples: Three words in vocabulary (sport, basketball, finance), a multinomial model generate the words by probabilities as (ps=0.5,pb=0.4,pf=0.1) (represented by the first character of each word) A document generated by this model contains 10 words Question: What is the expectation of occurrences of word “sport”? 10*0.5=5 What is the probability of generating 5 “sport”, 3 “basketball” and 2 “finance Does the word order matter here? Bag of words representation…

Common Probability Distribution Gaussian Model continuous distribution: draw data points close to a specific point Gaussian (Normal) distribution: select data points close (measured by ) to a specific point . Expectation: E(Xi)= Variance: Var(Xi)= , can be vectors: multivariate Gaussian

Common Probability Distribution Gaussian Example Gaussian (Normal) distribution with =[0 0], =[1 0;0 1]; 100 data points ‘+’randomly generated by the model

Bayes’s Rule Beyes’ Rule Suppose that B1, B2, … Bn form a partition of sample space S: Assume Pr(A) > 0. Then Reverse of Conditional Probability Definition Additivity Definition of Conditional Probability Normalization term

Prior probability of Hi Posterior probability of Hi Bayes’s Rule Interpretation of Beyes’ Rule Hypothesis space: H={H1 , …, Hn} Observed Data: D constant with respect to hypothesis To pick the most likely hypothesis H*, p(D) can be dropped Prior probability of Hi Posterior probability of Hi Likelihood of data if Hi is true

Common Probability Distribution Multinomial Examples: Five words in vocabulary (sport, basketball, ticket, finance, stock) Two topics as follows: Sport: (psp=0.4, pb=0.25, pt=0.25, pf=0.1, pst=0) Business: (psp=0.1, pb=0.1, pt=0.1, pf=0.3, pst=0.4) Prior Probability: Pr(Sport)=0.5; Pr(Business)=0.5 Given document =(sport, basketball, ticket, finance) What is the probability of , and ? If we already know Pr(Sport)=0.1; Pr(Business)=0.9, then what about and ?

Probability and Statistics: Outline Basic concepts of probability Conditional probability and Independence Common probability distributions Bayes’ Rule Statistics Inference Statistical learning Maximum likelihood estimation (MLE) Maximum posterior (MAP) estimation Introduction to optimization

Statistical Inference Examples: Five words in vocabulary (sport, basketball, ticket, finance, stock) Two topics as “Sport” and Business”, there are a set of documents from each topic. How can we estimate the multinomial distribution for two topics: e.g., Pr(“sport”|Business), Pr(“stock”|Sport)… Probability theory: Model Data Statistical Inference: Data Model/Parameters Especially with a small amount of observed data In general, statistics has to do with drawing conclusions on whole population based on observations of a sample (data)

Parameter Estimation Parameter Estimation: Given a probabilistic model that generates the data in an experiment The model gives a probability of any data p(D|) that depends on the parameter  We observe some sample data X={x1,…,xn}, what can we say about the value of ? Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” Generally an optimization problem

Parameter Estimation Example: Given a document topic model, which is a multinomial distribution Five words in vocabulary (sport, basketball, ticket, finance, stock) Observe two documents : (sport basketball ticket) : (sport basketball sport) Estimate the parameters of multinomial distribution (psp, pb, pt, pf, pst)

Maximum Likelihood Estimation (MLE) Find model parameters that make generation likelihood reach maximum: M*=argmaxMPr(D|M) There are K words in vocabulary, w1...wK (e.g., 5) Data: documents For with counts ci(w1), …, ci(wK), and length | | Model: multinomial M with parameters {p(wk)} Likelihood: Pr( |M) M*=argmaxMPr( |M)

Maximum Likelihood Estimation (MLE) Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate

Maximum Likelihood Estimation (MLE) Example: Given a document topic model, which is a multinomial distribution Five words in vocabulary (sport, basketball, ticket, finance, stock) Observe two documents : (sport basketball ticket) : (sport basketball sport) Maximum likelihood parameters of multinomial distribution (psp, pb, pt, pf, pst)=(3/6, 2/6, 1/6, 0/6, 0/6, 0/6) so (psp=0.5, pb=0.33, pt=0.17, pf=0, pst=0)

Maximum A Posterior (MAP) Estimation Maximum Likelihood Estimation: Zero probabilities with small sample (e.g., 0 for finance) Purely data driven, cannot incorporate prior belief/knowledge Maximum A Posterior Estimation: Select a model that maximizes the probability of model given observed data M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M) Pr(M): Prior belief/knowledge Use prior Pr(M) to avoid zero probabilities

Maximum A Posterior (MAP) Estimation There are K words in vocabulary, w1...wK (e.g., 5) Data: documents For with counts ci(w1), …, ci(wk), and length | | Model: multinomial M with parameters {p(wk)} Posterior: Pr(M| ) M*=argmaxMPr(M| )=argmaxMPr( |M)Pr(M) Prior Pr(M) is Pr(p1,…pK): Dirichlet Prior Hyper-parameters

Maximum A Posterior (MAP) Estimation Dirichlet Prior is the conjugate prior for multinomial distribution For the topic model estimation example, MAP estimator is: Pseudo count : (sport basketball ticket) : (sport basketball sport) Maximum a posterior parameters of multinomial distribution (psp, pb, pt, pf, pst)=((3+1)/(6+5), (2+1)/(6+5), (1+1)/(6+5), 1/(6+5), 1/(6+5) so (psp=0.364, pb=0.27, pt=0.18, pf=0.091, pst=0.091)

Introduction to Optimization The mathematical discipline which is concerned with finding the maxima and minima of functions, possibly subject to constraints. Example we have seen:

Introduction to Optimization Calculate analytic solution Calculate the first derivative (with Lagrange multiplier when subjected to constraints) Set the above equation to 0 and try to solve the solution Check whether second derivative is positive (minimum) or negative (maximum) Example: It is minimum

Introduction to Optimization Approximate solution with iterative method Many equations by setting derivative to zeros do not have analytic solution Iterative method refines solution step by step Newton method uses information of first derivative and second derivative to refine solution New updated solution Old solution

Introduction to Optimization Newton method does not guarantee improvement of new solution over old one Expectation and Maximization method Lower bound method, always make improvement More elegant, often have good probabilistic interpolation

Introduction to Optimization Expectation and Maximization method Examples: Given two biased dices A and B with known (PA(1),…, PA(6)) and (PB(1),…, PB(6)). Each time, with probability draw A, and with probability 1- draw B. We observe a sequence X={x1,…, xn } and want to estimate most possible

Introduction to Optimization Previous solution: Set

Introduction to Optimization Previous solution: Set Convexity of logarithm function, (Jensen Inequality)

Introduction to Optimization Set Convexity of logarithm function, (Jensen Inequality)

Introduction to Optimization Current solution: maximizes derived lower bound

Probability and Statistics: Outline Basic concepts of probability Conditional probability and Independence Common probability distributions Bayes’ Rule Statistics Inference Statistical learning Maximum likelihood estimation (MLE) Maximum posterior (MAP) estimation Introduction to optimization

Probability and Statistics: Outline References: Section 2.1, Foundation of Natural Language Processing http://cognet.mit.edu/library/books/mitpress/0262133601/cache/chap2.pdf (pp. 39-59) Online notes of probability and Statistics for computer science: http://www.utdallas.edu/~mbaron/3341/Fall06/ (Chaps 2,3,4,12) Probability and Statistics MH. DeGroot, MJ. Schervish 2001. Addison-Wesley Optimization (online): “Convex Optimization”, S. Boyd and L. Vandenberghe, http://www.stanford.edu/~boyd/cvxbook/