Corpus Linguistics Lecture 1 Albert Gatt. Contact details  My  Drop me a line with queries etc, and.

Slides:



Advertisements
Similar presentations
Writing Research Papers - A presentation by William Badke
Advertisements

Introduction: The Chomskian Perspective on Language Study.
Albert Gatt LIN1180/LIN5082 Semantics Lecture 2. Goals of this lecture Semantics -- LIN 1180 To introduce some of the central concepts that semanticists.
Introduction: A discourse perspective on grammar
Introduction to phrases & clauses
Albert Gatt LIN1180 – Semantics Lecture 10. Part 1 (from last week) Theories of presupposition: the semantics- pragmatics interface.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Introduction to Linguistics and Basic Terms
LIN1180/LIN5082 Semantics Lecture 1
EE 399 Lecture 2 (a) Guidelines To Good Writing. Contents Basic Steps Toward Good Writing. Developing an Outline: Outline Benefits. Initial Development.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Topic: Theoretical Bases for Cognitive Method Objectives Trainees will be able to give reasons for the design and procedures of the Cognitive Method.
LELA English Corpus Linguistics
Corpora and Language Teaching
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Lecture 1 Introduction: Linguistic Theory and Theories
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Generative Grammar(Part ii)
TOPIC 2: Some Basic Concepts
Research methods in corpus linguistics Xiaofei Lu.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Linguistic Theory Lecture 3 Movement. A brief history of movement Movements as ‘special rules’ proposed to capture facts that phrase structure rules cannot.
Emergence of Syntax. Introduction  One of the most important concerns of theoretical linguistics today represents the study of the acquisition of language.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
LIN1180/LIN5082 Semantics Lecture 3
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Unit 1 Language and Learning Methodology Unit 1 Language and learning I.How do we learn language ? 1 ) How do we learn our own language ? 2 ) How do.
Writing Research Papers. Research papers are often required of students in high school and in higher education.
Linguistics, Pragmatics & Natural Grammar
English Language Arts Level 7 #44 Ms. Walker
Researching language with computers Paul Thompson.
What is linguistics  It is the science of language.  Linguistics is the systematic study of language.  The field of linguistics is concerned with the.
Grammar: An Introduction Definitions, historical overview, dynamic nature.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Rules, Movement, Ambiguity
Linguistic Anthropology Bringing Back the Brain. What Bloomfield Got “Right” Emphasized spoken language rather than written language The role of the linguist.
English Language Services
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 16, March 6, 2007.
Introduction Chapter 1 Foundations of statistical natural language processing.
Communicative and Academic English for the EFL Professional.
Corpus search What are the most common words in English
SYNTAX.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Abstracting.  An abstract is a concise and accurate representation of the contents of a document, in a style similar to that of the original document.
Grammar Chapter 10. What is Grammar? Basic Points description of patterns speakers use to construct sentences stronger patterns - most nouns form plurals.
GRAMMAR AND PUNCTUATION REVISE AND REVIEW WORD CLASSES.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Academic Writing Fatima AlShaikh. A duty that you are assigned to perform or a task that is assigned or undertaken. For example: Research papers (most.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Child Syntax and Morphology
E303 Part II The Context of Language Research
An Introduction to Linguistics
Linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense that it scientifically studies the.
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Transformational & Generative Grammar
Corpus Linguistics I ENG 617
Competence and performance
Introduction to Semantics
Traditional Grammar VS. Generative Grammar
Presentation transcript:

Corpus Linguistics Lecture 1 Albert Gatt

Contact details  My  Drop me a line with queries etc, and to arrange meetings.

Course web page  Course web page: g/corpusLing.html Details of tutorials, lectures etc will always be on the web page.  Readings for the lecture  Downloadable lecture notes (available after the lecture)

Suggested text  T. McEnery and A. Wilson. (2001). Corpus Linguistics. Edinburgh University Press  NB: Over the course of these lectures, other readings will also be proposed and made available, usually online.

Lectures and assessment  Structure of lectures: all lectures will take place in the lab usually, about half the lecture (1hr) will be devoted to practical work  Course assessment: assignment Final essay (ca words) Essay topics will involve research on corpora!

Questions… ?

What is corpus linguistics?  A new theory of language? No. In principle, any theory of language is compatible with corpus-based research.  A separate branch of linguistics (in addition to syntax, semantics…)? No. Most aspects of language can be studied using a corpus (in principle).  A methodology to study language in all its aspects? Yes! The most important principle is that aspects of language are studied empirically by analysing natural data using a corpus. A corpus is an electronic, machine-readable collection of texts that represent “real life” language use.

Goals of this lecture  To define the terms: corpus linguistics corpus  To give an overview of the history of corpus linguistics  To contrast the corpus-based approach to other methodologies used in the study of language

An initial example  Suppose you’re a linguist interested in the syntax of verb phrases. Some verbs are transitive, some intransitive  I ate the meat pie (transitive)  I swam (intransitive)  What about: quiver quake  Are these really intransitive? Most traditional grammars characterise these as intransitive

One possible methodology…  The standard method relies on the linguist’s intuition: I never use quiver/quake with a direct object. I am a native speaker of this language. All native speakers have a common mental grammar or competence (Chomsky). Therefore, my mental grammar is the same as everyone else’s. Therefore, my intuition accurately reflects English speakers’ competence. Therefore, quiver/quake are intransitive.  NB: The above is a gross simplification! E.g. linguists often rely on judgements elicited from other native speakers.

Another possible methodology…  This one relies on data: I may never use quiver/quake with a direct object, but… …other people might Therefore, I’ll get my hands on a large sample of written and/or spoken English and check.

Quiver/quake: the corpus linguist’s answer  A study by Atkins and Levin (1995) found that quiver and quake do occur in transitive constructions: the insect quivered its wings it quaked his bowels (with fear)  Used a corpus of 50 million words to find examples of the verbs.  With sufficient data, you can find examples that your own intuition won’t give you…

Example II: lexical semantics  Quasi-synonymous lexical items exhibit subtle differences in context. strong powerful  A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning.

Example II continued  Some differences between strong and powerful (source: British National Corpus): strong powerful  The differences are subtle, but examining their collocates helps. wind, feeling, accent, flavourtool, weapon, punch, engine

Some preliminary definitions  The second approach is typical of the corpus-based methodology: Corpus: A large, machine-readable collection of texts.  Often, in addition to the texts themselves, a corpus is annotated with relevant linguistic information. Corpus-based methodology: An approach to Natural Language analysis that relies on generalisations made from data.

Example (British National Corpus)  British National Corpus (BNC): 100 million words of English  90% written, 10% spoken Designed to be representative and balanced. Texts from different genres (literature, news, academic writing…) Annotated: Every single word is accompanied by part-of-speech information.

Example (continued)  A sentence in the BNC: Explosives found on Hampstead Heath.   Explosives  found  on  Hampstead  Heath .

Example (continued)   Explosives  found  on  Hampstead  Heath . Explosives found on Hampstead Heath new sentence plural noun past tense verb preposition proper noun punctuation

Important to note  This is not “raw” text. Annotation means we can search for particular patterns. E.g. for the quiver/quake study: “find all occurrences of quiver which are verbs, followed by a determiner and a noun”  The collection is very large Only in very large collections are we likely to find rare occurrences.  Corpus search is done by computer. You can’t trawl through 100 million words manually!

The practical objections…  But we’re linguists not computer scientists! Do I have to write programs? No, there are literally dozens of available tools to search in a corpus.  Are all corpora good for all purposes? No. Some are “general-purpose”, like the BNC. Others are designed to address specific issues.

The theoretical objections…  What guarantee do we have that the texts in our corpus are “good data”, quality texts, written by people we can trust?  How do I know that what I find isn’t just a small, exceptional case. E.g. quiver in a transitive construction could be really a one-off!  Just because there are a few examples of something, doesn’t mean that all native speakers use a certain construction!  Do we throw intuition out of the window?

Part 2 A brief history of corpus linguistics

Language and the cognitive revolution  Before the 1950’s, the linguist’s task was: to collect data about a language; to make generalisations from the data (e.g. “In Maltese, the verb always agrees in number and gender with the subject NP”) The basic idea: language is “out there”, the sum total of things people say and write.  After the 1950’s: the so-called “cognitive revolution” language treated as a mental phenomenon no longer about collecting data, but explaining what mental capabilities speakers have

The 19 th & early 20 th Century  Many early studies relied on corpora.  Language acquisition research was based on collections of child data.  Anthropologists collected samples of unknown languages.  Comparative linguists used large samples from different languages.  A lot of work done on frequencies: frequency of words… frequency of grammatical patterns… frequency of different spellings…  All of this was interrupted around 1955.

Chomsky and the cognitive turn  Chomsky (1957) was primarily responsible for the new, cognitive view of language.  He distinguished (1965): Descriptive adequacy: describing language, making generalisations such as “X occurs more often than Y” Explanatory adequacy: explaining why some things are found in a language, but not others, by appealing to speakers’ competence, their mental grammar  He made several criticisms of corpus-based approaches.

Criticisms of corpora (I)  Competence vs. performance: To explain language, we need to focus on competence of an idealised speaker-hearer.  Competence = internalised, tacit knowledge of language Performance – the language we speak/write – is not a good mirror of our knowledge  it depends on situations  it can be degraded  it can be influenced by other cognitive factors beyond linguistic knowledge

Criticisms of corpora (II)  Early work using corpora assumed that: the number of sentences of a language is finite (so we can get to know everything about language if the sample is large enough)  But actually, it is impossible to count the number of sentences in a language. Syntactic rules make the possibilities literally infinite: the man in the house (NP -> NP + PP) the man in the house on the beach (PP -> PREP + NP) the man in the house on the beach by the lake …  So what use is a corpus? We’re never going to have an infinite corpus.

Criticisms of corpora (III)  A corpus is always skewed, i.e. biased in favour of certain things. Certain obvious things are simply never said. E.g. We probably won’t find a dog is a dog in our corpus.  A corpus is always partial: We will only find things in a corpus if they are frequent enough. A corpus is necessarily only a sample. Rare things are likely to be omitted from a sample.

Criticisms of corpora (IV)  Why use a corpus if we already know things by introspection?  How can a corpus tell us what is ungrammatical? Corpora won’t contain “disallowed” structures, because these are by definition not part of the language. So a corpus contains exclusively positive evidence: you only get the “allowed” things But if X is not in the corpus, this doesn’t mean it’s not allowed. It might just be rare, and your corpus isn’t big enough. (Skewness)

Refutations  Corpora can be better than introspectvie evidence because: They are public; other people can verify and replicate your results (the essence of scientific method). Some kinds of data are simply not available to introspection. E.g. people aren’t good at estimating the frequency of words or structures. Skewness can itself be informative: If X occurs more frequently than Y in a corpus, that in itself is an interesting fact.

Refutations (II)  By the way, nobody’s saying “throw introspection out the window”… There is no reason not to combine the corpus- based and the introspection-based method.  Many other objections can be overcome by using large enough corpora. Pre-1950, most corpus work was done manually, so it was error prone. Machine-readable corpora means we have a great new tool to analyse language very efficiently!

Corpora in the late 20 th Century  Corpus linguistics enjoyed a revival with the advent of the digital personal computer. Kucera and Francis: the Brown Corpus, one of the first Svartvik: the London-Lund Corpus, which built on Brown  These were rapidly followed by others… Today, corpora are firmly back on the linguistic landscape.

Summary  Introduced the notion of corpus and corpus-based research  Gave a quick overview of the history of this methodology  Looked at some possible objections to corpus-based methods, and some possible counter-arguments

Next lecture  We look more closely at some important properties of a corpus: Machine-readability Balance Representativeness …