Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press.

Slides:



Advertisements
Similar presentations
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Advertisements

Outline What is a collocation?
Random Variables A random variable is a variable (usually we use x), that has a single numerical value, determined by chance, for each outcome of a procedure.
IB Math Studies – Topic 6 Statistics.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
QUANTITATIVE DATA ANALYSIS
PSY 307 – Statistics for the Behavioral Sciences
Chapter 13 Analyzing Quantitative data. LEVELS OF MEASUREMENT Nominal Measurement Ordinal Measurement Interval Measurement Ratio Measurement.
Final Review Session.
BHS Methods in Behavioral Sciences I
Social Research Methods
Today Concepts underlying inferential statistics
Lecture 6: Descriptive Statistics: Probability, Distribution, Univariate Data.
PSY 307 – Statistics for the Behavioral Sciences
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 11 Chi-Square Tests and Strategies.
Understanding Research Results
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
AM Recitation 2/10/11.
Statistical Analysis I have all this data. Now what does it mean?
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
1 Chapter 4 Statistical Concepts: Making Meaning Out of Scores.
t-Test: Statistical Analysis
ANOVA Greg C Elvers.
Chapter 26: Comparing Counts AP Statistics. Comparing Counts In this chapter, we will be performing hypothesis tests on categorical data In previous chapters,
Measures of Dispersion CUMULATIVE FREQUENCIES INTER-QUARTILE RANGE RANGE MEAN DEVIATION VARIANCE and STANDARD DEVIATION STATISTICS: DESCRIBING VARIABILITY.
Chapter 14 Nonparametric Tests Part III: Additional Hypothesis Tests Renee R. Ha, Ph.D. James C. Ha, Ph.D Integrative Statistics for the Social & Behavioral.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
PSSA Coach Lesson 28 Measures of Central Tendency M11.E Unit 5: Data Analysis & Probability.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Statistics in Biology. Histogram Shows continuous data – Data within a particular range.
Recap of data analysis and procedures Food Security Indicators Training Bangkok January 2009.
Chi Square Classifying yourself as studious or not. YesNoTotal Are they significantly different? YesNoTotal Read ahead Yes.
Chi Squared Test. Why Chi Squared? To test to see if, when we collect data, is the variation we see due to chance or due to something else?
 Two basic types Descriptive  Describes the nature and properties of the data  Helps to organize and summarize information Inferential  Used in testing.
Making sense of data We got to deal with some Math here folks.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
Inferential Statistics. Coin Flip How many heads in a row would it take to convince you the coin is unfair? 1? 10?
Three Broad Purposes of Quantitative Research 1. Description 2. Theory Testing 3. Theory Generation.
Introduction to Basic Statistical Tools for Research OCED 5443 Interpreting Research in OCED Dr. Ausburn OCED 5443 Interpreting Research in OCED Dr. Ausburn.
Statistical Analysis. Null hypothesis: observed differences are due to chance (no causal relationship) Ex. If light intensity increases, then the rate.
Data Analysis.
Quantitative analysis and R – (1) LING115 November 18, 2009.
Psychology’s Statistics Psychology’s Statistics Appendix (page A1 - A13)
STATISTICS FOR SCIENCE RESEARCH (The Basics). Why Stats? Scientists analyze data collected in an experiment to look for patterns or relationships among.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Chapter 14 Chi-Square Tests.  Hypothesis testing procedures for nominal variables (whose values are categories)  Focus on the number of people in different.
AP STATISTICS Section 7.1 Random Variables. Objective: To be able to recognize discrete and continuous random variables and calculate probabilities using.
Remember You just invented a “magic math pill” that will increase test scores. On the day of the first test you give the pill to 4 subjects. When these.
9-7Independent and Dependent Events 9-7 Independent and Dependent Events (pg ) Indicator: D7.
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
Why do we analyze data?  It is important to analyze data because you need to determine the extent to which the hypothesized relationship does or does.
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
Inquiry 1 written and oral reports are due in lab the week of 9/29. Today: Statistics This weekend, on 6th street. Pecan Street Arts Festival (
Psychology’s Statistics Appendix. Statistics Are a means to make data more meaningful Provide a method of organizing information so that it can be understood.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Chi Square Pg 302. Why Chi - Squared ▪Biologists and other scientists use relationships they have discovered in the lab to predict events that might happen.
AP Statistics From Randomness to Probability Chapter 14.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Probability Distributions Chapter 6.
STATS DAY First a few review questions. Which of the following correlation coefficients would a statistician know, at first glance, is a mistake? A. 0.0.
Different Types of Data
Why is this important? Requirement Understand research articles
Business Statistics Topic 4
Social Research Methods
STATS DAY First a few review questions.
Statistical Evaluation
The Chi-Square Distribution and Test for Independence
Honors Statistics From Randomness to Probability
STATISTICS Topic 1 IB Biology Miss Werba.
Descriptive Statistics
Presentation transcript:

Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press

2 Rock bottom basics Central tendency –With any set of numerical scores (eg frequency counts of word types, lengths of sentences in a corpus) –mode (most frequently obtained score) Easily affected by chance scores –median (the score nearest the middle of the range of scores) Will be close to mean if data evenly distributed –mean (average) in equations

3 Rock bottom basics Probability of an event a, usually written P(a) –For a set of alternative events, total of all probabilities is 1 –Events assumed to be independent This can be counter-intuitive, but (in coin toss) the chance of heads is always 1/2, whatever the preceding tosses were Probability of an event a, given some other condition b is written P(a|b) –Notice that P(a|b) is independent of P(b ) - eg P(skelter|helter) Not to be confused with the probability of two events co-occurring –written P(a,b) –which is not the same as the combined probability P(a)  P(b)

4 Simple word counts A simple frequency count on its own might not tell you anything Need to compare it with something else –Frequency counts of other similar things –Or the frequency count that you might expect on average Then need to see if the measured difference is significant

5 Statistical significance Probably most commonly used statistic in all social science is t-test Understood that any result could be due to random chance Statistical significance tells you what level of random chance would be responsible for the result you get Usually involves looking something up in a table –Level of certainty –Number of variables or degrees of freedom

6 Correlation Frequency counts might provide an ordered list You might want to compare counts of two things to see if they are correlated, eg word length in English and number of characters in Chinese (Xu 1996) Person’s rho There’s also a formula for rank correlation

7 Xu (1996) XYsqr(X)sqr(Y)XY TOTAL N= Critical value for 15 pairs of observations at 5% level of confidence is 0.441, so result is not statistically significant (it is at 10% level though)

8 Comparison with expected values We might want to compare relative frequencies of a range of features Chi-square test shows if frequency differences are significant where O is observed value, E is expected value

9 Yamamoto (1996) Frequencies of types of 3 rd -person reference in English and Japanese Sum = 258.8, significant for (5-1)x(2-1)=4dfs at 0.1% level JapaneseEnglishTOTALSE(J)E(E)X 2 (J)X 2 (E) Ellipsis Central pronouns Non-central pronouns Names Common NPs TOTAL

10 Co-occurrence Is distribution of two things correlated? Contingency table –eg sentences where two words co-occur or not Phi coefficient Dice’s coefficient Several variants W1 not W1 W2 a b not W2 c d

11 Co-occurrence Scores such as Dice’s coefficient need to be turned into something like a t score, so that significance can be measured

12 Co-occurrence  Mutual information Measures the relatedness of two variables compares joint and combined Ps P  0 = chance association P>>0 strong association P<<0 complementary distribution In terms of contingency matrix:

13 Church & Hanks (1990) Used MI to show word associations –Eg doctors + {dentitsts,nurses,treating,treat, examine,bills,hospitals} –In contrast with doctors + {with,a,is} –Identify phrasal verbs eg set + {up,off,out,in} but not about –Using a parser to separate N and V readings, most likely objects of verb drink –What you can do to a telephone (sit by, disconnect, answer, …)

14

15 Church et al. (1991) strong vs powerful experiment MIword pairMIword pair 10.47strong northerly8.66powerful legacy 9.76strong showings8.58powerful tool 9.3strong believer8.35powerful storms