Semantics and discourse: Collocations, keywords and reliability of manual coding Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide.

Slides:



Advertisements
Similar presentations
Chapter 18: The Chi-Square Statistic
Advertisements

Correlation, Reliability and Regression Chapter 7.
Conceptualization and Measurement
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative.
QUANTITATIVE DATA ANALYSIS
Introduction to Educational Statistics
Chapter One An Introduction to Business Statistics McGraw-Hill/Irwin Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
PPA 415 – Research Methods in Public Administration Lecture 2 - Counting and Charting Responses.
Statistical Evaluation of Data
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Variables cont. Psych 231: Research Methods in Psychology.
Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories Ordinal measurement Involves sorting objects.
Chapter 3: Graphic Presentation
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242.
Statistical Methods for Multicenter Inter-rater Reliability Study
Completing the Experiment. Your Question should be in the proper format: The Effect of Weight on the Drone’s Ability to Fly in Meters In this format,
Graphic Presentation The Pie Chart The Bar Graph The Statistical Map
Introduction to Statistics What is Statistics? : Statistics is the sciences of conducting studies to collect, organize, summarize, analyze, and draw conclusions.
Evidence Based Medicine
10/3/20151 PUAF 610 TA Session 4. 10/3/20152 Some words My –Things to be discussed in TA –Questions on the course and.
1.What is this graph trying to tell you? 2.Do you see anything misleading, unclear, etc.? 3.What is done well?
Statistical Evaluation of Data
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
All Hands Meeting 2005 The Family of Reliability Coefficients Gregory G. Brown VASDHS/UCSD.
Introduction To Statistics. Statistics, Science, ad Observations What are statistics? What are statistics? The term statistics refers to a set of mathematical.
1 PSY 230, Jacobs Welcome to PSY 230 Introduction to Statistics Elizabeth Jacobs, Ph.D.
Anthony J Greene1 Central Tendency 1.Mean Population Vs. Sample Mean 2.Median 3.Mode 1.Describing a Distribution in Terms of Central Tendency 2.Differences.
Scatterplots & Correlations Chapter 4. What we are going to cover Explanatory (Independent) and Response (Dependent) variables Displaying relationships.
Statistics & Evidence-Based Practice
Basic statistics for corpus linguistics
Nonparametric Statistics
Chapter 12 Understanding Research Results: Description and Correlation
BINARY LOGISTIC REGRESSION
Describing Relationships
Applications to Social Work Research
Notes on Logistic Regression
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Bi-variate #1 Cross-Tabulation
Introduction to Corpus Linguistics: Exploring Collocation
Selecting a test: lesson 2
Applied Statistical Analysis
Chapter 15: Correlation.
Understanding Research Results: Description and Correlation
Multiple logistic regression
Nonparametric Statistics
Introduction to Statistics
Research Methods: Unit 4: Past Paper
Natalie Robinson Centre for Evidence-based Veterinary Medicine
CHAPTERs 2 & 3 Research in Psychology: Getting Started & measurement
COM 633: Content Analysis Reliability
Elementary statistics, bluman
Change over time: Working with diachronic data
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Zheng Xie, Chai Gadepalli, Barry M.G. Cheetham,
Lexico-grammar: From simple counts to complex models
Introduction: Statistics meets corpus linguistics
Statistics II: An Overview of Statistics
Part I Review Highlights, Chap 1, 2
Register variation: correlation, clusters and factors
Modeling with Dichotomous Dependent Variables
15.1 The Role of Statistics in the Research Process
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Research Methods: Data analysis and reporting investigations.
Chapter 18: The Chi-Square Statistic
Chapter 3: Graphic Presentation
Descriptive Statistics
Bivariate Correlation
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Semantics and discourse: Collocations, keywords and reliability of manual coding Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Statistical analysis of discourse involves an inherent paradox Statistical analysis of discourse involves an inherent paradox. While discourse is often fluid, ambiguous and fuzzy, statistics expects rigour, precision and clearly defined categories.

Think about and discuss What associations come to your mind when you see the word love? Why do you think the word has these associations for you? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

collocation window (span): 1L 1R Collocations collocates node collocation window (span): 1L 1R Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Collocations (cont.) Is my really a genuine collocate of love in the poem? In other words, is my really strongly associated with love? Observed frequency (3) compared with: No baseline: We compare the observed frequencies of all individual words co-occurring with the node and produce a rank-ordered list. Random co-occurrence baseline (‘shake the box’ model): We compare the observed frequencies with frequencies expected by chance alone and evaluate the strength of collocation using a mathematical equation which puts emphasis on a particular aspect of the collocational relationship. Word competition baseline: We use a different type of baseline from random co-occurrence; this baseline is incorporated in the equation, which again highlights a particular aspect of the collocational relationship. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

‘Shake the box’ model Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Association measures MI3 z score T score log likelihood MI2 log Dice 𝐎 𝟏𝟏 − 𝐄 𝟏𝟏 𝐄 𝟏𝟏 Association measures 𝐥𝐨𝐠 𝟐 𝐎 𝟏𝟏 𝟑 𝐄 𝟏𝟏 𝐥𝐨𝐠 𝟐 𝐎 𝟏𝟏 𝟐 𝐄 𝟏𝟏 𝟐× 𝐎 𝟏𝟏 ×𝐥𝐨𝐠 𝐎 𝟏𝟏 𝐄 𝟏𝟏 + 𝐎 𝟏𝟐 ×𝐥𝐨𝐠 𝐎 𝟏𝟐 𝐄 𝟏𝟐 𝐌 𝐢𝐧 𝐰𝐢𝐧𝐝𝐨𝐰 − 𝐌 𝐨𝐮𝐭𝐬𝐢𝐝𝐞 𝐰𝐢𝐧𝐝𝐨𝐰 𝐩𝐨𝐨𝐥𝐞𝐝 𝐒𝐃 𝐥𝐨𝐠 𝟐 𝐎 𝟏𝟏 × 𝐑 𝟐 𝐎 𝟐𝟏 × 𝐑 𝟏𝐜𝐨𝐫 𝟐×𝐎 𝟏𝟏 𝐑 𝟏 + 𝐂 𝟏 𝑶 𝟏𝟏 − 𝑬 𝟏𝟏 𝑶 𝟏𝟏 MI3 z score T score log likelihood MI2 log Dice Delta P Dice log ratio 𝐎 𝟏𝟏 𝐑 𝟏 − 𝐎 𝟐𝟏 𝐑 𝟐 Cohen’s d 𝟏𝟒+ 𝐥𝐨𝐠 𝟐 𝟐×𝐎 𝟏𝟏 𝐑 𝟏𝐜𝐨𝐫 + 𝐂 𝟏 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Association measures (cont.) Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Association measures (cont.) Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Collocation networks node C3 C2 C2 N4 C1 C3 C2 C1 N3 N2 C1 C2 C1 C3 C5 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

CPN (Brezina et al. 2015) Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Keywords 100M Corpus of interest C Reference corpus R Decision Positive keywords + Lockwords Negative keywords - Corpus of interest C Reference corpus R Decision frequent infrequent + (positive keyword) - (negative keyword) comparable freq. 0 (lockword) 1M Corpus of interest Reference corpus Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Keywords (cont.) Log likelihood SMP (with 100 as the constant) Log ratio Cohen’s d U.S. LABOR TOWARD PERCENT NEIGHBORHOOD AMERICAN PROGRAM DEFENSE RECOGNIZE CONGRESSIONAL NEIGHBORS ATLANTA COLORED STATES BUSH PGF2A MANHATTAN FEDERAL MACDOWELL FAVORITE MRNA RECOGNIZED PRESIDENT CENTER MR. ABBY REALIZE GENOME RECOGNIZING PROGRAMS FLORIDA TRAVELED UNITED 9-11 SIGNALED STATE WASHINGTON DOE COLOR CONGRESS POE CALIFORNIA AMERICANS ROUSSEAU GOTTEN NS1 REZKO FAVOR AMERICA MITCH FINALLY WAR ADDITIVES CENTERS Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Inter-rater agreement Inter-rater agreement, which is an estimate of how reliable and consistent a coding is, should be reported in studies working with a judgement variable. Judgement variable is a variable that involves categorisation or evaluation of cases (e.g. concordance lines) by the analyst that might bring an element of subjectivity into the study. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Inter-rater agreement (cont.) Positive or Negative? Categorisation is a matter of choice. Rigorous? Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Inter-rater agreement (cont.) negative positive Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Inter-rater agreement (cont.) negative positive Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Inter-rater agreement (cont.) raw agreement = cases of agreement total no. of cases = 8 10 =0.8 raw agreement = cases of agreement total no. of cases Rater 1 Rater 2 negative YES positive NO Agreement statistic= raw agreement − agreement by chance 1 − agreement by chance AC 1 = 0.8 − 0.32 1 − 0.32 = 0.71 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Cohen’s κ/AC 1 0.71 Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.

Inter-rater agreement (cont.) Type of judgement variable No. of values No. of raters Statistic(s) to use Nominal (categories) 2 and more 2 Gwet’s AC1 and Cohen’s κ 3 and more Gwet’s AC1 and Fleiss' κ Ordinal (ranks) Gwet’s AC2 Interval/Ratio (scale) Interclass correlation (ICC)

Things to remember There are many association measures each highlighting different aspects of the collocational relationship (e.g. frequency or exclusivity). There is no one best association measure. Collocations can be presented in a tabular (table) or visual form (graph). Collocation networks show complex cross-associations in texts and discourses. The keyword procedure in its essence is a comparison which depends on a number of parameters. There is no such thing as one set of keywords. For judgement variables inter-rater agreement statistic should be reported. Gwet’s AC1 and AC2, Cohen’s κ and Fleiss' κ as well as Interclass correlation can be used depending on the situation. Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.