BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
LELA English Corpus Linguistics
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Research methods in corpus linguistics Xiaofei Lu.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
English Corpora and Language Learning Tamás Váradi
Memory Strategy – Using Mental Images
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
ELN – Natural Language Processing Giuseppe Attardi
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
Survey of Semantic Annotation Platforms
Researching language with computers Paul Thompson.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
CH.4 PROBABILITY AND TEXT SAMPLING Data mining LAB 이아람.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Basics of Natural Language Processing Introduction to Computational Linguistics.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Language Identification and Part-of-Speech Tagging
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.
Presentation transcript:

BTANT 129 w5 Introduction to corpus linguistics

BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained: the corpus of Anglo-Saxon verse The Oxford Companion to the English Language The modern view –A collection of naturally occurring language text chosen to characterize a state or variety of a language John Sinclair Corpus Concordance Collocation OUP

BTANT 129 w5 Corpus vs. archive Text archive Collection of texts in their original format (Oxford Text Archive: Corpus texts collected and processed in a unified, systematic manner British National Corpus:

BTANT 129 w5

Short history Brief mention of just a select few! Brown Corpus (Brown university) –1 m words –15 genres –500 samples 2000 words each –Area: US –Time: 1961 LOB Corpus (Lancaster-Bergen-Oslo) –GB replica of Brown

BTANT 129 w5 Cobuild Major corpus initiative by Collins and Birmingham Univ. John Sinclair m -> Bank of English currently 450 m words

BTANT 129 w5 British National Corpus 100 m words careful selection 10 % spoken material time span 1960 (fiction) – 1975 non-ficion) word texts TEI compliant SGML coding ex/

BTANT 129 w5

International Corpus of English 20 corpora of 1 m words devoted to varieties of English around the world 500 texts (300 written 200 spoken) of 2000 words each time span: ICE-GB available in demo version syntactic annotation, graphical tool ICECUP

BTANT 129 w5

Corpus processing: tokenization Preprocessing –tokenization segmenting the text into sentences sometimes tricky: sentence delimiters in mid- sentence positions words multi-word units – problem –Normalization restoring clitics, abbreviations ("can't", "I've")

BTANT 129 w5 Corpus processing: tagging Tagging –labelling every word with its Part of Speech category –Problem: ambiguity out of context, words can belong to different part of speech or have different analysis within the same POS –set N vs. set V –bánt 'bánik' VBD vagy 'bánt' VBZ

BTANT 129 w5 Corpus processing: disambiguation Disambiguation –defining the correct analysis in context Two approaches: both needs manually corrected training corpus –statistical Hidden Markov model calculating probability within a span of usually one or two words rate of success can be around 98% –rule-based

BTANT 129 w5 Syntactic annotation Difficult to do on such a scale shallow parsing Treebank: collection of syntactically analyzed sentences Penn treebank

BTANT 129 w5 Recent trends Word sense ambiguation (SENSEVAL) Message understanding jects/muc/index.htmlhttp:// jects/muc/index.html SEMANTIC WEB making information on the web understandable for machines a vision requiring a huge effort, not clear whether feasible at all

BTANT 129 w5 Representative sample? A corpus any size is inevitably a sample Of what? Two approaches –sampling speakers – demographic sampling –sampling their output – text type sample

BTANT 129 w5 The notion of representativeness Sample vs. population sample should be proportional to the population for a given feature –example for demographic sampling if we know from census figures that 48% of people in living in Budapest are male we should compile our sample so that 48% of the informants are male -> our sample is representative of Budapest residents for gender

BTANT 129 w5 Trouble with representativeness What should be the units of sampling? Registers, text types, genres etc. But no independent evidence about their ratio in the totality of language output -> representativeness is an ideal but impossible to implement

BTANT 129 w5 Approaches to Representativeness Douglas Biber: Rejects notion of proportional sampling Sample should be as varied as possible Representativeness measured in terms of wide variety of text types included in the sample

BTANT 129 w5 The Web as a corpus? Pro: immense database dynamically growing ideal 'quick and dirty' method Cons: lots of rubbish, irrelevant data difficult to extract hits no language analysis only string query, which is crude

BTANT 129 w5 One quick example Representativity or representativeness Throw the two words at Google and have a look at the figures Think about the conclusions There are special front-end sites

BTANT 129 w5