What’s in a Corpus? School of Computing

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

Diachronic study and language change Corpus Linguistics Richard Xiao
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
What is Word Study? PD Presentation: Union 61 Revised ELA guide Supplement (and beyond)
Corpus Processing and NLP
Natural Language Understanding Difficulties: Large amount of human knowledge assumed – Context is key. Language is pattern-based. Patterns can restrict.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Stemming, tagging and chunking Text analysis short of parsing.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Part of speech (POS) tagging
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
CHAPTER 1: Language in Our Lives
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
9/8/20151 Natural Language Processing Lecture Notes 1.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Morphology & Syntax Dr. Eid Alhaisoni. Basic Definitions Language : a system of communication by written or spoken words, which is used by people of a.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Introduction to CL & NLP CMSC April 1, 2003.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Introduction to Linguistics Class # 1. What is Linguistics? Linguistics is NOT: Linguistics is NOT:  learning to speak many languages  evaluating different.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction Chapter 1 Foundations of statistical natural language processing.
Natural Language Processing Chapter 2 : Morphology.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Basics of Natural Language Processing Introduction to Computational Linguistics.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
Child Syntax and Morphology
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Natural Language Processing (NLP)
Natural Language Processing
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

What’s in a Corpus? School of Computing FACULTY OF ENGINEERING What’s in a Corpus? Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Reminder Why NLP is difficult: language is a complex system How to solve it? Corpus-based machine-learning approaches Motivation: applications of “The Language Machine” BACKGROUND READING: (Atwell 99) The Language Machine Intro to NLTK Visit the website: http://www.nltk.org

Today The main areas of linguistics Rationalism: language models based on expert introspection Empiricism: models via machine-learning from a corpus Corpus: text selected by language, genre, domain, … Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … Corpus Annotation: text headers, PoS, parses, … Corpus size is no. of words – depends on tokenisation We can count word tokens, word types, type-token distribution Lexeme/lemma is “root form”, v inflections (be v am/is/was…)

The main sub-areas of linguistics ◮ Phonetics and Phonology: The study of linguistic sounds or speech. ◮ Morphology: The study of the meaningful components of words. ◮ Syntax (grammar): The study of the order and links between words. ◮ Semantics: The study of meanings of words, phrases, sentences. ◮ Discourse: The study of linguistic units larger than a single utterance. ◮ Pragmatics: The study of how language is used to accomplish goals.

Why is NLP hard? Main reason: Ambiguity in all areas and on all levels, e.g: ◮ Phonetic Ambiguity: 1 expression being pronounced in several ways ◮ POS Ambiguity: 1 word having several different Parts of Speech (adjective/noun...) ◮ Lexical Ambiguity: 1 word having several different meanings ◮ Syntactic/Structural Ambiguity: 1 phrase or sentence having several different possible structures ◮ Pragmatic Ambiguity: 1 sentence communicating several different intentions ◮ Referential Ambiguity: 1 expression having several different possible references Key Task in NLP: Disambiguation in context!

Rationalism v Empiricism Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary) Noam Chomsky, 1957 Syntactic Structures Argued that we should build models through introspection: A language model is a set of rules thought up by an expert Like “Expert Systems”… Chomsky thought data was full of errors, better to rely on linguists’ intuitions…

Empiricism v Rationalism Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary) The field was stuck for quite some time: rationalist linguistic models for a specific example did not generalise. A new approach started around 1990: Corpus Linguistics Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz Main idea: machine learning from CORPUS data How to do corpus linguistics: Get large text collection (a corpus; plural: several corpora) Compute statistical models over the words/PoS/parses/… in the corpus Surprisingly effective

What is a corpus? A corpus is a finite machine-readable body of naturally occurring text, selected according to specified criteria, eg: ◮ Language and type: English/German/Arabic/…, dialects v. “standard”, edited text v. spontaneous speech, … ◮ Genre and Domain: 18th century novels, newspaper text, software manuals, train enquiry dialogue... ◮ Web as Corpus: URL “domain” = country: .uk .ar ◮ Media: “Written” Text, Audio, Transcriptions, Video. ◮ Size: 1000 words, 50K words, 1M words, 100M words, ???

Brown and LOB ◮ Brown: Famous first corpus! (well, first widely-used corpus) ◮ by Nelson Francis and Henry Kucera, Brown University USA ◮ A balanced corpus: representative of a whole language ◮ Brown: balanced corpus of written, published American English from 1960s (newspapers, books, … NOT handwritten) ◮ 1 million words, Part-of-Speech tagged. ◮ LOB: Lancaster-Oslo/Bergen corpus: British English version ◮ published British English text from equivalent 1960s sources ◮FROWN, FLOB: US, UK text from equivalent 1990s sources

Some recent corpora Corpus features: Size, Domain, Language British National Corpus: 100M words, balanced British English Newswire Corpus: 600M words, newswire, American English UN or EU proceedings: 20M+ words, legal, 10 language pairs Penn Treebank: 2M words, newswire American English MapTask: 128 dialogues, British English Corpus of Contemporary Arabic: 1M words, balanced Arabic Web: 8 billion(?) words, many domains and languages Web-as-Corpus: harvest your own corpus from WWW, via “seed terms”  Google API  web-pages  Corpus! Marco Baroni: BootCat, Adam Kilgarriff: SketchEngine, …

Corpus Annotation Annotation is a process in which linguistics experts add (linguistic) information to the corpus that is not explicitly there (increases utility of a corpus), e.g.: ◮Text Headers: meta-data for each text: author, date, type,… ◮ Part-of-speech tag for each word (very common!). ◮ Syntactic structure: parse-tree for each sentence ◮ Word Sense label for each word ◮ Prosodic information: pauses, rise and fall in pitch, etc.

Annotation example: POS tagging ◮ Some texts are annotated with Part-of-speech (POS) tags. ◮ POS tags encode simple grammatical functions. <s><w pos=RN> Here </w> <w pos=BEZ> is </w> <w pos=AT> a </w> <w pos=NN> sentence </w>.</s> ◮ Several tag sets: ◮ Brown tag set (87 tags) in Brown corpus ◮ CLAWS / LOB tag set (132 tags) in LOB corpus ◮ Penn tag set (45 tags) in Penn Treebank ◮ CLAWS c5 tag set (62 tags) in BNC (British National Corpus) ◮ Tagging is usually done automatically (then proofread and corrected)

http://www.comp.leeds.ac.uk/eric/atwell00icamej.pdf

http://www.comp.leeds.ac.uk/eric/atwell08clih.pdf

What’s a word? How many words do you find in the following short text? What is the biggest/smallest plausible answer to this question? What problems do you encounter? It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much. Time: 1 minute

Counting words: tokenization Tokenisation is a processing step where the input text is automatically divided into units called tokens where each is either a word or a number or a punctuation mark… So, word count can ignore numbers, punctuation marks (?) Word: Continuous alphanumeric characters delineated by whitespace. Whitespace: space, tab, newline. BUT dividing at spaces is too simple: It’s, data base

Counting words: types v tokens ◮ Word token: individual occurrence of words. ◮ Q: How big is the corpus (N)? = how many word tokens are there? (LOB: 1M; BNC: 100M) ◮Word type: the “word itself” regardless of context ◮ Q: How many “different words” (word types) are there? = Size of corpus vocabulary (LOB: 50K, BNC: 650K) ◮ Q: What is the frequency of each word type? = type-token distribution A few word=types (the of a …) are very frequent, but most are rare, and half of all word-types occur only once! Zipf’s Law

Other sorts of “words” ◮ Lemma/Lexeme: dictionary form of a word. cost and costs are derived from the same lexeme “cost”. data-base, data base, database, databases – same lexeme Can include spaces: data base, New York Ambiguous tokenization: as well (= also), as well as (= and) Inflection: grammatical variant, eg cost v costs ◮ Morpheme: basic “atomic” indivisible unit of meaning or grammar, e.g. data, base, s ◮ For languages other than English, morphological analysis can be hard: root/stem, affixes (prefix, postfix, infix) morph ologi cal or morpho logic al ?

ب ت ك ? و ? ? مَ ? ِ? ا ? مكتوب كاتب b t k Arabic Morphology Templatic Morphology ب ت ك Root b t k Pattern ? و ? ? مَ ? ِ? ا ? ū ma i ā Psycholinguistic reality format  فرمت farmat Dictionary ordered Not all combinations possible مكتوب كاتب Lexeme maktūb written kātib writer Lexeme.Meaning = (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random

Arabic Morphology Root Meaning + Pattern meaning + ?? ك ت ب KTB = notion of “writing” كتاب /kitāb/ book كتب /katab/ write مكتوب /maktūb/ written مكتبة /maktaba/ library مكتوب /maktūb/ letter مكتب /maktab/ office كاتب /kātib/ writer

Reminder Rationalism: language models based on expert introspection Empiricism: models via machine-learning from a corpus Corpus: text selected by language, genre, domain, … Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … Corpus Annotation: text headers, PoS, parses, … Corpus size is no. of words – depends on tokenisation We can count word tokens, word types, type-token distribution Morpheme: basic lexical unit, “root form”, plus affixes Lexeme: dictionary entry, can be multi-word: New York