Corpora and Statistical Methods

Slides:



Advertisements
Similar presentations
Chapter 3 Properties of Random Variables
Advertisements

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Business Statistics for Managerial Decision
Discrete Random Variables
Measures of Dispersion or Measures of Variability
Chapter 4 Discrete Random Variables and Probability Distributions
Calculating & Reporting Healthcare Statistics
Introduction to Probability and Statistics
BCOR 1020 Business Statistics Lecture 9 – February 14, 2008.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
3-1 Introduction Experiment Random Random experiment.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Chap 5-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 5-1 Chapter 5 Discrete Probability Distributions Basic Business Statistics.
Lecture 6: Descriptive Statistics: Probability, Distribution, Univariate Data.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Some basic concepts of Information Theory and Entropy
STATISTIC & INFORMATION THEORY (CSNB134)
2. Mathematical Foundations
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 4 and 5 Probability and Discrete Random Variables.
1 CY1B2 Statistics Aims: To introduce basic statistics. Outcomes: To understand some fundamental concepts in statistics, and be able to apply some probability.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
Probability, contd. Learning Objectives By the end of this lecture, you should be able to: – Describe the difference between discrete random variables.
Probability The definition – probability of an Event Applies only to the special case when 1.The sample space has a finite no.of outcomes, and 2.Each.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Applied Business Forecasting and Regression Analysis Review lecture 2 Randomness and Probability.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
PROBABILITY CONCEPTS Key concepts are described Probability rules are introduced Expected values, standard deviation, covariance and correlation for individual.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Discrete Random Variables. Numerical Outcomes Consider associating a numerical value with each sample point in a sample space. (1,1) (1,2) (1,3) (1,4)
3.3 Expected Values.
The two way frequency table The  2 statistic Techniques for examining dependence amongst two categorical variables.
QM Spring 2002 Business Statistics Probability Distributions.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Random Variables (1) A random variable (also known as a stochastic variable), x, is a quantity such as strength, size, or weight, that depends upon a.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
ENGR 610 Applied Statistics Fall Week 2 Marshall University CITE Jack Smith.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Random Variables. Numerical Outcomes Consider associating a numerical value with each sample point in a sample space. (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
Lecture 6 Dustin Lueker.  Standardized measure of variation ◦ Idea  A standard deviation of 10 may indicate great variability or small variability,
Copyright © Cengage Learning. All rights reserved. 3 Discrete Random Variables and Probability Distributions.
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Copyright © Cengage Learning. All rights reserved. 3 Discrete Random Variables and Probability Distributions.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
1 Review of Probability and Random Processes. 2 Importance of Random Processes Random variables and processes talk about quantities and signals which.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
AP Statistics From Randomness to Probability Chapter 14.
L Basic Definitions: Events, Sample Space, and Probabilities l Basic Rules for Probability l Conditional Probability l Independence of Events l Combinatorial.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Virtual University of Pakistan
Linear Algebra Review.
Discrete Random Variables
Random Variables Random variables assigns a number to each outcome of a random circumstance, or equivalently, a random variable assigns a number to each.
Virtual University of Pakistan
Keller: Stats for Mgmt & Econ, 7th Ed
STA 291 Spring 2008 Lecture 6 Dustin Lueker.
Probability distributions
Probability Key Questions
Honors Statistics From Randomness to Probability
Keller: Stats for Mgmt & Econ, 7th Ed Sampling Distributions
Probability.
Mathematical Foundations of BME Reza Shadmehr
Continuous Distributions
Probability, contd.
Presentation transcript:

Corpora and Statistical Methods Albert Gatt

Probability distributions

Example 1: Book publishing Case: publishing house considers whether to publish a new textbook on statistical NLP considerations include: production cost, expected sales, net profits (given cost) Problem: to publish or not to publish? depends on expected sales and profits if published, how many copies? depends on demand and cost

Example 1: Demand & cost figures Suppose: book costs €35, of which: publisher gets €25 bookstore gets €6 author gets €4 To make a decision, publisher needs to estimate profits as a function of the probability of selling n books, for different values of n. profit = (€25 * n) – overall production cost

Terminology Random variable Distribution In this example, the expected profit from selling n books is our random variable It takes on different values, depending on n We use uppercase (e.g. X) to denote the random variable Distribution The different values of X (denoted x) form a distribution. If each value x can be assigned a probability (the probability of making a given profit), then we can plot each value x against its likelihood.

Definitions Random variable Probability distribution A variable whose numerical value is determined by chance. Formally, a function that returns a unique numerical value determined by the outcome of an uncertain situation. Can be discrete (our exclusive focus) or continuous Probability distribution For a discrete random variable X, the probability distribution p(x) gives the probabilities for each value x of X. The probabilities p(x) of all possible values of X sum to 1. The distribution tells us how much out of the overall probability space (the “probability mass”), each value of x takes up.

Tabulated probability distribution No. copies sold Prod. cost Profits (X) Probability P(x) 5,000 £275,000 -£150,000 .20 10,000 £300,000 -£50,000 .40 20,000 £350,000 £150,000 .25 30,000 £400,000 .10 40,000 £450,000 £550,000 .05

Plotting the distribution

Uses of a probability distribution Computation of: mean: the expected value of X in the long run based on the specific values of X, and their probability NB: NOT interpreted as value in a sample of data, but expected (future) value based on sample. standard deviation & variance: the extent to which actual values of X will differ from the mean skewness: the extent to which our distribution is “balanced”, i.e. whether it’s symmetrical

In graphics… Mean: expected value in the long run SD & variance: How much actual values deviate from mean overall Skewness: Symmetry or “tail” of our distribution

Measures of expectation and variation

The expected value (mean) The expected value of a discrete random variable X, denoted E[X] or μ, is a weighted average of the values of X weighted, because not all values x will have the same probability estimated by summing, for all values of X, the product of x and its probability p(x)

More on expected value The mean or expected value tells us that, in the long run, we can expect X to have the value μ. E.g. in our example, our book publisher can expect long- term profits of: (-150,000 * .2) + (-50,000 * .4) + (150,000 * .25) + (350,000 * .1) + (550,000 * .05) = €50,000

Variance Mean is the expected value of X, E[X] Variance (σ2) reflects the extent to which the actual outcomes deviate from expectation (i.e. from E[X]) σ2 = E[(X – μ)2] = Σ(x – μ)2p(x) i.e. the weighted sum of deviations squared Reasons for squaring: eliminates the distinction between +ve and –ve makes it exponential: larger deviations are given more importance e.g. one deviation of 10 is as large as 4 deviations of 5

Standard deviation Variance gives overall dispersion or variation Standard deviation (σ) is the dispersion of possible outcomes; it indicates how spread out the distribution is. estimated as square root of variance

The book publishing example again Recall that for our new book on stat NLP, expected profit is £50,000 What’s the standard deviation? need to estimate (50000-x)2 for all x multiply by p(x) in each case take the square root of the result This is left as an exercise…

Skewness The mean gives us the “centre” of a distribution. Standard deviation gives us dispersion. Skewness (denoted γ “gamma”) is a measure of the symmetry of the outcomes.

Skewness, continued The formula calculates the average value of cubed deviations by the standard deviation cubed. Why cubed? The cube of a positive deviation is itself positive; that of a negative is itself negative. We want both, as we want to know deviations both to the left (-ve) and right (+ve) of the mean. Like the variance estimation, this emphasises large deviations in either direction (it’s exponential). If the outcomes are symmetrical around the mean, then +ve and –ve deviations are balanced, and skewness is 0.

Graphical display of skewness Positive skewness: tail going right Negative skewness: tail going left

Skewness and language By Zipf’s law (next week), word frequencies do not cluster around the mean. There are a few highly frequent words (making up a large proportion of overall word frequency) There are many highly infrequent words (f = 1 or f = 2) So the Zipfian distribution is highly skewed. We will hear more on the Zipfian distribution in the next lecture.

The concept of information

What is information? Main ingredient: Example: an information source, which “transmits” symbols from a finite alphabet S every symbol is denoted si we call a sequence of such symbols a text assume a probability distribution s.t. every si has probability p(si) Example: a dice is an information source; every throw yields a symbol from the alphabet {1,2,3,4,5,6} 6 successive throws yield a text of 6 symbols

Quantifying information Intuition: the more probable a symbol is, the less information it yields “something seen very often is not very surprising” So information is the inverse probability of the symbol for some b > 1. Usually we use base 2 Another term for I(s) is surprisal

Properties of I Non-negative If p(s) = 1, I(s) = 0 If 2 events s1, s2 are independent, then: Monotonic: slight changes in probability result in slight changes in I

Aggregate measure of information What is the information content of a text (sequence of symbols)? this is the same as finding the average information of a random variable the measure is called Entropy, denoted H Define X as a random variable over the symbols in our alphabet P(s) = P(X=s) for all s in our alphabet Estimate H(P)

Entropy The entropy (or information) of a probability distribution is entropy is the expected value (mean) of the surprisal the value is interpreted as the number of “bits” of information

Entropy example Source = an 8-sided die Alphabet S = {1,2,3,4,5,6,7,8} every si has p = 1/8

Interpretation of entropy The information contained in the distribution P (the more unpredictable the outcomes, the higher the entropy) The message length if the message was generated according to P and coded optimally

Interpretation cont/d For the 8-sided die example, the result H(P)=3 tells us we need 3 bits on average to “transmit” the result of rolling an 8-sided die: We can’t do it in less than 3 bits 1 2 3 4 5 6 7 8 001 010 011 100 101 110 111 000

Entropy for multiple variables So far we have dealt with a single random variable The joint entropy of a pair of RVs:

Conditional Entropy Given X and Y, how much information about Y do we gain if we know X? a version of entropy using conditional probability: H(Y|X)

Mutual information

Mutual information Just as probability can change based on posterior knowledge, so can information. Suppose our distribution gives us the probability P(a) of observing the symbol a. Suppose we first observe the symbol b. If a and b are not independent, this should alter our information state with respect to the probability of observing a. i.e. we can compute p(a|b)

Mutual info between two symbols The change in our information about a on observing b is: If a and b are completely independent, I(a;b)=0.

Averaging mutual information We want to average mutual information between all values of a random variable A and those of a random variable B. And similarly:

Combining the two… Thus, mutual info involves taking the joint probability and dividing by the individual probabilities. I.e. a comparison of the likelihood of observing a, b together vs. separately.

Mutual Information: summary Gives a measure of reduction in uncertainty about a random variable X, given knowledge of Y quantifies how much information about X is contained in Y

Some more on I(X;Y) In statistical NLP, we often calculate pointwise mutual information this is the mutual information between two points on a distribution I(x;y) rather than I(X;Y) used for some applications in lexical acquisition

Mutual Information -- example Suppose we’re interested in the collocational strength of two words x and y e.g. bread and butter mutual information quantifies the likelihood of observing x and y together (in some window) If there is no interesting relationship, knowing about bread tells us nothing about the likelihood of encountering butter Here, P(x,y) = P(x)P(y) and I(x;y) = 0 This is the Church and Hanks (1991) approach. NB. The approach uses pointwise MI