UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Advertisements

An Ontology Creation Methodology: A Phased Approach
Corpora in grammatical studies
Corpus Linguistics Richard Xiao
Corpus Linguistics Richard Xiao
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Extraction and Visualisation of Emotion from News Articles Eva Hanser, Paul Mc Kevitt School of Computing & Intelligent Systems Faculty of Computing &
Corpus Linguistics: Counting words, texts or features Mike Scott, University of Liverpool Corpus Linguistics Summer Institute June-July 2008.
SYNTAX 4 DAY 33 – NOV 13, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Word Classes and POS Tagging Read J & M Chapter 8. You may also want to look at: view.html.
Corpus Processing and NLP
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Statistical NLP: Lecture 3
SAMUELS Closing Symposium Huddersfield Project Lesley Jeffries, Brian Walker and Jane Demmen.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Research methods in corpus linguistics Xiaofei Lu.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
LIN 3098 Corpus Linguistics – Lecture 4
Introduction to Linguistics
1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.
Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail:
Linguistic Essentials
1 And yeah, it was really good! Positive stance in native and learner speech Sylive De Cock Centre for English Corpus Linguistics Université catholique.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Engaging with data Choices and decisions. Seeing or looking at? The advance of corpus linguistics has certainly changed the way that we can look at our.
MedKAT Medical Knowledge Analysis Tool December 2009.
WORDS The term word is much more difficult to define in a technical sense, and like many other linguistic terms, there are often arguments about what exactly.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
An Introduction to Semantic Parts of Speech Rajat Kumar Mohanty rkm[AT]cse[DOT]iitb[DOT]ac[DOT]in Centre for Indian Language Technology Department of Computer.
GRAMMAR AND PUNCTUATION REVISE AND REVIEW WORD CLASSES.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Automatic Writing Evaluation
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.
Statistical NLP: Lecture 3
Computational and Statistical Methods for Corpus Analysis: Overview
Machine Learning in Natural Language Processing
Topics in Linguistics ENG 331
Natural Language - General
CS : Language Technology For The Web/Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
Linguistic Essentials
Natural Language Processing
Corpus processing tools
Knowledge Representation for Natural Language Understanding
By Hossein Hematialam and Wlodek Zadrozny Presented by
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell Corpus Mark-up UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell

Aims Introduce the concepts of corpus mark-up and annotation Consider why we would want to add extra non-textual information to corpus texts Use a pos-tagger and tagged text

What is Corpus Annotation? ‘the practice of adding interpretative linguistic information to a corpus’ (Leech 2005) interpretative linguistic results in -> value-added corpus

Terminology Corpus Markup Tagging Parsing XML processing/formatting information metadata/text classifications structural representation Tagging (usually) inline addition of category to word(s) Parsing higher-level, multiword units (constituents) chunking/shallow vs. full syntactical parsing needn’t just be syntactical analysis XML eXtensible Markup Language

Why Annotate? Manual examination of corpus Automatic analysis of corpus Reusability of annotations Multi-functionality Objective record of analysis Annotation process is corpus analysis Leech 2005 McEnery 2003 O’Donnell 1999

Types of Corpus Annotation Part-of-speech (POS) Lemmatization Syntactical (parsing) Semantic (domain classifications) Coreference (Discourse) Pragmatic (Speech acts – dialogue) Stylistic Research specific (ad hoc)

POS Tagging: Claws C5 Corpus_NN1 annotation_NN1 is_VBZ the_AT0 practice_NN1 of_PRF adding_VVG interpretative_AJ0 linguistic_AJ0 information_NN1 to_PRP a_AT0 corpus_NN1 ._. NN1 singular noun AJ0 adjective (unmarked) VBZ -s form of the verb "BE“ PRF the preposition OF VVG -ing form of lexical verb AT0 article

POS Tagging: Claws C7 Corpus_NN1 annotation_NN1 is_VBZ the_AT practice_NN1 of_IO adding_VVG interpretative_JJ linguistic_JJ information_NN1 to_II a_AT1 corpus_NN1 ._. http://www.comp.lancs.ac.uk/ucrel/claws/trial.html

POS Tagging: POSTagger Corpus/NN annotation/NN is/VBZ the/DT practice/NN of/IN adding/VBG interpretative/JJ linguistic/JJ information/NN to/TO a/DT corpus/NN ./.

Parsing: Chunking [NP (NN Corpus) (NN annotation) ] (VBZ is) [NP (DT the) (NN practice) ] (IN of) (VBG adding) [NP (JJ interpretative) (JJ linguistic) (NN information) ] [PP (TO to) [NP (DT a) (NN corpus) ]

Parsing (S (NP Corpus annotation) (VP is (NP (NP the practice) (PP of (S (VP adding (NP interpretative linguistic information) (PP to (NP a corpus)) )) ) .)

Semantic Annotation Each word given code from thesaurus-style dictionary Also called Word Sense Tagging Examples UCREL Semantic Analysis System [http://www.comp.lancs.ac.uk/ucrel/usas/] WordNet [http://wordnet.princeton.edu/]

Semantic Annotation The noun move has 5 senses (first 5 from tagged texts) 1. (377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer") 2. (70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire") 3. (57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility") 4. (30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path") 5. (5) move -- ((game) a player's turn to take some action permitted by the rules of the game)

Semantic Annotation The verb move has 16 senses (first 13 from tagged texts) 1. (130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell") 2. (60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant") 3. (52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right") 4. (20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")

Tools XML Annotation Editors GATE WordSmith

The ‘Great Annotation Debate’ Leech et al. ‘annotation = value added’ Sinclair ‘annotation = perilous activity’ Scott ‘beware of the POS prison!’

Sinclair on the perils of corpus annotation ‘The interspersing of tags in a language text is a perilous activity, because the text thereby loses integrity…’ ‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)

Sinclair on the perils of corpus annotation ‘..one cosy consequence of using tagged text is that the description which produces the tags in the first place is not challenged – it is protected. The corpus data can only be observed through the tags; that is to say, anything the tags are not sensitive to will be missed’ ‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)

Sinclair on the perils of corpus annotation ‘In corpus-driven linguistics you do not use pre-tagged text, but you process the raw text directly and then patterns of this uncontaminated text are able to be observed.’ ‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)

Hunston – annotation as ‘double-edged sword’ ‘…the categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which in turn tends to limit, not the kind of question that can be asked, but the kind of question that usually is asked.’ (Hunston 2002: 93)

Hunston – annotation as ‘double-edged sword’ ‘Most of the work that is done using annotated corpora uses categories that have been developed in pre-corpus days, such as nominal clauses, anaphoric reference… Phenomena such as frames or semantic prosody… tend to have been identified from plain text corpora and word-based studies.’ (Hunston 2002: 93)

Corpus-based approach annotated corpus CORPUS METHODS ANALYSIS categorization DATA plain corpus ANALYSIS generalization Annotate Corpus POS Parsing Semantic Reference RESULTS

Corpus-driven approach plain corpus CORPUS METHODS DATA ANALYSIS generalization & categorization RESULTS

Problem for both CB & CD Approach Serial/Sequential process CB analysis before (annotation) and after processing CD analysis only after processing (so no need for annotation) Empirical process is cyclic analysis feeds back into process and around again… and again…

So what if…. Hunston - ‘Most of the work that is done using annotated corpora uses categories that have been developed in pre-corpus days….’ we annotate categories that have come out of corpus analysis instead of/as well as traditional categories? (Hunston 2002: 93)

New uses for corpus annotation Cyclic investigation process KWIC/Frequency list/Collocates etc. Annotate results Goto 1 How sould we annotate: collocates lexical items semantic associations/prosodies Local textual functions

References Leech, G 2005 ‘Adding Linguistic Annotation’, in M. Wynne, Developing Linguistic Corpora: a Guide to Good Practice (Oxford: Oxbrow Books), pp. 17-29 [http://ahds.ac.uk/linguistic-corpora/] Hunston, S. 2002 Corpora in Applied Linguistics (Cambridge: Cambridge University Press) McEnery, A 2003 ‘Corpus Linguistics’, in R. Mitov (ed.), The Oxford Handbook of Computational Linguistics (Oxford: Oxford University Press), pp. 448-463

References O’Donnell, M.B. ‘The Use of Annotated Corpora for New Testament Discourse Analysis: A Survey of Current Practice and Future Prospects’, in S.E. Porter and J.T. Reed (eds.), Discourse Analysis and the New Testament: Results and Applications (Sheffield: Sheffield Academic Press, 1999), pp. 71-117. Sinclair, J. 2004 Trust the Text: Language, Corpus and Discourse (London: Routledge)