Overview of Corpus Linguistics

Slides:



Advertisements
Similar presentations
Corpora in grammatical studies
Advertisements

Diachronic study and language change Corpus Linguistics Richard Xiao
Diachronic study and language change Corpus Linguistics Richard Xiao
Uses of a Corpus “[E]xplore actual patterns of language use”
Corpora in the classroom: Forging new paths Randi Reppen Northern Arizona University ©2010 Randi Reppen.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Introduction: A discourse perspective on grammar
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
The origins of language curriculum development
Pedagogic uses of a corpus of student writing and their implications for sampling and annotation Alois Heuboeck University of Reading, UK.
LELA English Corpus Linguistics
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
Corpus 05 Grammar. Unlike lexicography, grammar does not have a long tradition of empirical study. Prescriptive vs descriptive: traditionally, grammatical.
Corpora and Language Teaching
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
U SING C ORPUS - BASED R ESEARCH FOR L ANGUAGE T EACHING AND L EARNING ENGLISH 510 Hee Sung (Grace) Jun & Kimberly LeVelle.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
How Can Corpora Help Me To Be Successful in CO150?
QUALITATIVE RESEARCH What is the distinction between Inductive and Deductive research? Qualitative research methods – produces observations that are not.
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Introduction Chapter 1 Foundations of statistical natural language processing.
Building and analysing your own corpus 1. Building a corpus.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Register Analysis. Registers we use Think of all of the reading, writing, listening, and speaking you have done in the past week.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Corpus search What are the most common words in English
Using Corpora to Teach Vocabulary Helping Students Help Themselves 1.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Experimental Psychology PSY 433 Chapter 5 Research Reports.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Language Identification and Part-of-Speech Tagging
Vocabulary Module 2 Activity 5.
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Exploring Collocation
Introduction to Corpus Linguistics: Applications Lexicography
Intro to corpus linguistics: Data Driven Grammar
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
Presentation transcript:

Overview of Corpus Linguistics 10/8/14 Overview of Corpus Linguistics Ling 240 © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 1

Outline Definition History Current status

What is corpus linguistics? Linguistics: the scientific study of language using… Corpus: a large and principled collection of natural texts

History of corpus linguistics As early as 1897, Wilhelm Kaeding compiled a had 5,000 people compile a corpus of 11 million German words (and calculate their frequency, distribution of letters). In the early 1900s, Otto Jesperson, a Danish professor filled shoeboxes with thousands of paper slips containing interesting English sentences. In 1959, Randolph Quirk started the Survey of English Usage of spoken and written language which he used to create a comprehensive English grammar.

History of corpus linguistics 1961: Brown Corpus 1M words 500 samples of 2,000 words Various genres; printed, edited, American English 1961: Lancaster-Oslo/Bergen (LOB) Corpus British version of Brown Corpus 1991: Frown and FLOB Corpora 1988: International Corpus of English (ICE) World English varieties 20 completed so far

History of corpus linguistics 1991: British National Corpus (BNC) 100M words Wide range of written (90%) and spoken (10%) texts 2008: BYU Corpora Corpus of Contemporary American English (COCA) TIME corpus Corpus of Historical American English (COHA) GloWbE Corpus International Corpus of Learner English (ICLE) MICASE & MICUSP

Status of corpus linguistics Is corpus linguistics a branch of linguistics or a method for doing linguistics? Evidence for branch: Journals such as Corpora and the International Journal of Corpus Linguistics Some researchers claim corpus linguistics as their area of emphasis Evidence for method: Most linguistic phenomena can be measured using CL CL has the potential to inform virtually any theory

Characteristics of corpus-based analyses It is empirical, analyzing the actual patterns of use in natural texts; It utilizes a large and principled collection of natural texts, known as a “corpus” as the basis for analysis; It makes extensive use of computers for analysis, using both automatic and interactive techniques; It depends on both quantitative and qualitative analytical techniques

Uses of corpora Changes over time Changes in register Changes in situation Changes in individual

Time

Different Genitives Of genitive The leg of the table 's genitive The table's leg NN genitive The table leg

‘s genitive vs. of-genitive vs. NN sequence

NN sequence across time in three registers

Situation

Phrasal Compression Uncompressed Compressed The dog that was hungry was looking for something to eat. Drugs that require a prescription should be monitored Compressed The hungry dog was looking for something to eat. Prescription drugs should be monitored.

Phrasal compression across levels in an EAP reading series

Phrasal compression across levels in another EAP reading series

Individual

‘Abstract Exposition versus Concrete Action’

Corpus Design and Representativeness Ling 240

Designing Representative Corpora Many people believe that the design of a corpus doesn’t matter as long as it is large enough. Researchers typically focus on target domain representativeness and ignore linguistic representativeness Target domain (medical texts, newspapers, academic, general English, spoken) Very few corpora are actually evaluated in terms of their representativeness (target domain or linguistic)

Steps—representing the target domain Describe the target domain Design the corpus to represent target domain Complete the sampling Simple random Randomly choose sections of the data for the corpus Stratified Determine what genres are included and randomly sample from those data Cluster Divide data into naturally occurring groups and sample from them

Norming practice (raw count/total words) * 1000 Text A Text B # Nouns 50 100 # Words 200 1000 (raw count/total words) * 1000 Text A: (50 nouns / 200 words) * 1000 = 250 nouns per thousand words Text B: (100 nouns / 500 words) * 1000 = 200 nouns per thousand words

Norming practice BNC has 100 million words COCA has 450 million words # Per M snuck 11 767 sneaked 132 830

Corpus Annotation Ling 240

Annotation Corpora can be annotated for a wide range of external and internal variables. External variables Speaker L1 background Gender Extralinguistic information (e.g., laughter, nodding, etc.)

External annotation—example <Exam ID: 3B> <Arrangement ID: 54945> <Center ID: 14> <Candidate ID: 42285> <Test Date: 12/6/2013> <Age: 19> <Gender: F> <L1: Arabic> <Reason for test: B> <Original MELAB: 2> <Original Transformed: 3> <Second MELAB: > <Second Transformed: > <End header> E: Alright, welcome to the MELAB speaking exam, my name is <deleted>. And uh what is your name? T: Uh my name is uh <deleted>. E: Now I'll just uh read the MELAB ID number that we have for you. Uh you don't need to know it or anything. The number is <deleted>. Alright now that's out of the way. Uh why don't you uh start by telling me a little bit about why you're taking the MELAB today. T: Uh actually I came to USA to complete my education here, so uh if I want to go to university I need to get score and to to be good in speak English and I take uh a lab exam so I can enter to the university. E: Okay uh so uh what are you interested in studying at the university? T: Actually I think about uh medical science.

Part of speech tagging Rule-based Probability-based 95%+ accuracy rate Some features very easy (e.g., the) Some features more difficult (e.g., that) Pronoun (He doesn’t like that.) Determiner (He doesn’t like that dog.) Complementizer (They thought that they could do it.) Relativizer (The thought that I entertained.)

POS tagging accuracy Accuracy Example Precision – What percent of the cases labeled as X are actually X? Recall – What percent of all of the true cases of X were labeled as X? Example He saw that dog that I saw. If both ‘that’s are tagged as determiners: Calculate the precision and recall for determiners Calculate the precision and recall for relativizers

POS tagging—two examples CLAWS Tagger Biber Tagger Everything_PN1 I_PPIS1 've_VH0 read_VVN says_VV0 they_PPHS2 were_VBDR warned_VVN to_TO leave_VVI immediately_RR Everything ^pn++++=Everything I ^pp1a+pp1+++=I've 've ^vb+hv+aux++0=EXTRAWORD read ^vprf+++xvbnx+=read says ^vb+vpub+++=say's they ^pp3a+pp3+++=they were ^vbd+bed+aux++=were warned ^vpsv++agls+xvbnx+=warned to ^to+vcmp+++=to leave ^vbi++++=leave immediately ^rb+tm+++=immediately

Lemmatization Lemma The citation or dictionary entry Run is the lemma It includes the words run, running, runs, ran We often want the frequency of the lemma not of a particular word like running

Answer these questions about COCA What external annotation does it contain? What internal annotation does it contain?

Answer these questions about COCA What external annotation does it contain? Text source Date of publication What internal annotation does it contain? Lemmatization Part of speech Genre

Example: ‘s-’ versus ‘of-genitive’ ‘the bird’s owner’ vs. ‘the owner of the bird’ Finding 1: “by 1991, the s-genitive had overtaken the of-genitive in frequency” (Leech, et al., 2009) Finding 2: of-genitive is almost 10 times more frequent than the s-genitive in present-day English (Longman Grammar) Q: Are these findings contradictory??? 34

Corpus study design—variationist Two approaches to corpus linguistics: Variationist and Text-Linguistic (Biber, 2012) Variationist: “has the goal of comparing linguistic variants: whether one or the other variant is preferred,” and identifying factors that predict which variant is used (Biber, 2012). Statistics: Binomial/logistic regression; Linear discriminant analysis Interpretation: When a choice can be made, variant X is preferred over variant Y, and factors A, B, and C play a role. 35

Variationist Analysis (Type A) Unit of analysis is linguistic feature Most studies do not take register into account (e.g. collocational studies) Comparison of the proportion of use in a particular register E.g., Benedict Szmrecsanyi & Hinrichs, 2008 preference of s-genitive over of-genitive in speech; BUT: s-genitives overall more frequent in writing.

Corpus study design—text-linguistic Text-Linguistic: “has the goal of providing a linguistic description of texts, by describing the density of grammatical features in texts” (Biber, 2012) Statistics: T-test, ANOVA, Multiple regression, Factor analysis Interpretation: Feature X is more frequent in context A than context B; or Feature X is more frequent than feature Y 37

Text-linguistic (Type B) Comparison of actual frequency of use in a particular register Unit of analysis is text Normed rates of occurrence by text Much more common for register studies

Text-linguistic (Type C) Also compares frequencies of use in a particular register Unit of analysis is subcorpus Normed rates of occurrence for features across subcorpora Cannot use inferential statistics (need to look at individual text to get a mean score for the register)

Quantitative analysis Coding/tagging features Counts in text vs. subcorpus Norming (raw count/total words * 1000) Use appropriate statistical tests if applicable

Kinds of Corpora Spoken language General corpora (mainly written) Bitext (two languages side-by-side) Specialized Children’s speech L2 learner speech Historical

General Corpora Mainly written British National Corpus (BNC) 100 million words 10% spoken 25% fiction 75% non-fiction

General Corpora Corpus of Contemporary American English (COCA) 450 million words (more added every year) Divided into registers Spoken Fiction Academic

General Corpora International Corpus of English 1 million words from each country 60% spoken, 40% written

Historical Corpora Helsinki Corpus English texts from 770-1700 Corpus of Historical American English (COHA) 1860-present

Introduction to COHA Corpus of Historical American English End up verbing Try and verb versus try to verb Adjectives and nouns used in 2000s not before Collocates of Muslim, liberal, Mormon ?

Raw Corpora Not easily searchable Not tagged Project Gutenberg Pre 1928 books (copyright expired) Online newspapers Time Magazine The internet General Conference

Where can you get corpora? Online BNC, COCA, COHA Distributors (membership required) ELRA (based in Europe) Linguistic Data Consortium US based BYU has a membership Catalog Top 10 corpora