Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Slides:



Advertisements
Similar presentations
Chi square.  Non-parametric test that’s useful when your sample violates the assumptions about normality required by other tests ◦ All other tests we’ve.
Advertisements

R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Parametric/Nonparametric Tests. Chi-Square Test It is a technique through the use of which it is possible for all researchers to:  test the goodness.
1 Analysing and teaching meaning Analysing and teaching meaning SSIS Lazio - Lesson 2 prof. Hugo Bowles January 2007.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
PSY 307 – Statistics for the Behavioral Sciences
CHI-SQUARE TEST OF INDEPENDENCE
1 Collocation and translation MA Literary Translation- Lesson 2 prof. Hugo Bowles February
MODULE 2 Meaning and discourse in English
Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
CHI-SQUARE GOODNESS OF FIT TEST What Are Nonparametric Statistics? What is the Purpose of the Chi-Square GOF? What Are the Assumptions? How Does it Work?
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
5-3 Inference on the Means of Two Populations, Variances Unknown
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 9: Introduction to the t statistic
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Chapter 6 Probability.
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Lesson 16 Lexical semantics 2 (collocation)
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
AM Recitation 2/10/11.
Albert Gatt Corpora and Statistical Methods – Part 2.
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Statistical Natural Language Processing Diana Trandabăț
1 Introduction to Natural Language Processing ( ) Words and the Company They Keep AI-lab
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)
Chapter 15 Data Analysis: Testing for Significant Differences.
1 COMP791A: Statistical Language Processing Collocations Chap. 5.
10/04/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Words and the Company They Keep Dr. Jan Hajič CS Dept., Johns.
Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
1 Natural Language Processing (3b) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Natural Language Processing (5) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
Introduction to Analytical Chemistry
1 Statistical NLP: Lecture 7 Collocations (Ch 5).
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Chapter 13 - ANOVA. ANOVA Be able to explain in general terms and using an example what a one-way ANOVA is (370). Know the purpose of the one-way ANOVA.
Tests of Random Number Generators
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Testing Differences between Means, continued Statistics for Political Science Levin and Fox Chapter Seven.
COLLOCATIONS He Zhongjun Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Analyzing Statistical Inferences July 30, Inferential Statistics? When? When you infer from a sample to a population Generalize sample results to.
Chapter 10 Statistical Inference for Two Samples More than one but less than three! Chapter 10B < X
Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent variable.
366_7. T-distribution T-test vs. Z-test Z assumes we know, or can calculate the standard error of the distribution of something in a population We never.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
Example The strength of concrete depends, to some extent on the method used for drying it. Two different drying methods were tested independently on specimens.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.
Nonparametric statistics. Four levels of measurement Nominal Ordinal Interval Ratio  Nominal: the lowest level  Ordinal  Interval  Ratio: the highest.
 Confidence Intervals  Around a proportion  Significance Tests  Not Every Difference Counts  Difference in Proportions  Difference in Means.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Collocations David Guy Brizan Speech and Language Processing Seminar 26 th October, 2006.
Statistical NLP: Lecture 7
Many slides from Rada Mihalcea (Michigan), Paul Tarau (U.North Texas)
BHS Methods in Behavioral Sciences I
7.4 Hypothesis Testing for Proportions
Presentation transcript:

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan

Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book

Chapter 5 Collocations January 29, 2007

Collocations Eg: –Strong tea; weapons of mass destruction; to make up; rich and powerful; Compositionality –Meaning is the sum of parts Not true for collocations and in particular idioms –Kick the bucket; heard it through the grapevine Contextual theory of meaning –Firth’s quote: you judge a word by the company it keeps

Frequency

Filtering by POS

Collocation in New York Times News wire articles

Mean and Variance Not all collocations are bi-grams Eg: –She knocked on his door –They knocked at the door –A man knocked on the metal front door The terms: “knocked” & “door” go together You would not say: “hit” or “beat” or “rap” instead of “knock”

Counting collocations with intervening words

Mean & Variance She knocked on his door They knocked at the door A man knocked on the metal front door 100 women knocked on Donaldson’s door

Strong support Strong leftist support Strong business support

Analysis of variance Larger variance indicates “strong” & “for” do not form interesting collocations.

Hypothesis Testing The fact that two words co-occur does not necessarily make it a collocation. Words can appear together merely as a chance appearance. Null Hypothesis H 0 –There is no association between the words beyond chance. P(w 1 w 2 ) = P(w 1 )P(w 2 )

The t test We can reject the null hypothesis, that the mean of the population is 158cm As the t value is large. The confidence level for this 99.5%. Null Hypothesis: the mean height of a Population is 158 cm.

Applying t test to corpora Applying t test to the collocation: “new companies” The goal is to find out if the occurrence of “new companies” explained by chance behavior? Text corpus is viewed as a long string of bigrams – a string of two word pairs

Applying t test to text corpora The word “new” occurs: times The word “companies” occurs: 4675 There are 14,307,668 tokens (words) P(new) = 15828/14,307,668 P(companies) = 4675/14,307,668 Null Hypothesis: occurrence of these two words are independent. H 0 = P(new)xP(companies) =~ 3.615x10 -7

Applying the t test to the bigrams The null hypothesis cannot be rejected since the t value is less than – new companies is NOT a collocation.

Finding Collocations using t test It should be noted, that stop words were used to eliminate most collocations.

Applying t tests to detect differences in semantics

Pearson’s chi-square test Chi-square test does not assume normal distribution. Basically a measure of Observed vs. expected values and determine if observed is significantly different From expected values. Used in variety of interesting ways: - in machine translations - in determining similarity between text sources

Likelihood ratios

Calculating the likelihood ratio

Relative Frequency Ratios

Mutual Information

Notion of Collocation Non-Compositionality –Idioms: kick the bucket; white wine Non-Substitutability –Cannot say “yellow wine” Non-Modifiability –Phrase: “was poor as church mouse” cannot be modified to “people as poor as church mice” –Phrase: to get a frog in one’s throat –To say: to get an ugly frog in one’s throat