Corpus Linguistics Richard Xiao

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
UoL Summer Institute in Corpus Linguistics Matthew Brook O’Donnell
CLiNG - May Overview of Research - Computational Terminology - Knowledge extraction from Text - Study of causal relation - Corpus building - Uncertainty.
LIN3098 Corpus Linguistics, Lecture 3 Albert Gatt.
Corpus Processing and NLP
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
CS460/449 : Speech, Natural Language Processing and the Web/Topics in AI Programming (Lecture 2– Introduction+ML and NLP) Pushpak Bhattacharyya CSE Dept.,
The contribution of NLP Corpus processing Ontologies and terminologies
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Introduction to Computational Linguistics Lecture 2.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Metalanguage Revision English language year
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Language Identification and Part-of-Speech Tagging
Statistical NLP: Lecture 3
Revision Outcome 1, Unit 1 The Nature and Functions of Language
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Year 2 Objectives: Writing
Corpus-Based ELT CEL Symposium Creating Learning Designers
LING/C SC 581: Advanced Computational Linguistics
Topics in Linguistics ENG 331
Computational Linguistics: New Vistas
Introduction to Text Analysis
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Outline of the session Lecture Lab Rationale for corpus annotation Leech’s maxims of corpus annotation Types of annotation Lab CLAWS POS tagger (online and Windows-based) Introducing Wmatrix ICTCLAS

Corpora and annotation Unannotated corpus simple plain text or raw text the linguistic information is implicit e.g. no explicit representation of present as a noun Annotated corpus no longer just text real repository of linguistic information the relevant linguistic information is now explicit (e.g. present as a noun, adjective, or verb)

Corpus annotation What is annotation? “The process of adding […] interpretive, linguistic information to an electronic corpus of spoken and/or written language data” (Leech 1997) Broadly, also refers to the results of the annotation process In a strict sense, different from corpus markup Markup provides objective, verifiable information e.g. author, paragraph boundary Annotation is concerned with interpretive linguistic information e.g. part-of-speech

Why annotate a corpus? It makes information retrieval and extraction easier, faster and enables human analysts to exploit and retrieve analyses of which they are not themselves capable Annotated corpora are reusable resources Annotated corpora are multifunctional - they can be annotated with a purpose and be reused with another Corpus annotation records a linguistic analysis explicitly Corpus annotation provides a standard reference resource, a stable base of linguistic analyses, so that successive studies can be compared and contrasted on a common basis

How are corpora annotated? Automatic annotation Can be automated reliably for some types (POS, lemmatization) Can annotate large amount of data quickly at low cost Post-editing or human correction may be necessary to improve accuracy Computer-assisted annotation The semi-automatic annotation process (human-machine interface) may produce more reliable results than fully automated annotation, but it is also slower and more costly Manual annotation Occurs where no annotation tool is available or where the accuracy of available systems is not high enough to be useful Expensive and time-consuming, typically only feasible for small corpora

Leech’s 7 maxims of annotation 1. It should be possible to remove the annotation from an annotated corpus in order to revert to the raw corpus. 2. It should be possible to extract the annotations by themselves from the text. 3. The annotation scheme should be based on guidelines which are available to the end user. 4. It should be made clear how and by whom the annotation was carried out. The end user should be made aware that the corpus annotation is not error-free or infallible, but simply a potentially useful tool. 6. Annotation schemes should be based as far as possible on widely agreed and theory-neutral principles. 7. No annotation scheme has the a priori right to be considered as a standard. Standards emerge through practical consensus.

Types of corpus annotation Phonological level Syllable boundaries (phonetic/phonemic annotation) Prosodic or suprasegmental features (prosodic annotation, e.g. pitch, loudness, intonation) Morphological level Prefixes, suffixes, stems (morphological annotation) Lexical level Tokenisation (essential for Chinese) Parts of speech (POS tagging) e.g. present: NN1, VVB, JJ Lemmas (lemmatization) stop, stopped, stops, stopping → stop Semantic fields (semantic annotation) cricket: sport, insect

Tokenisation The one-to-one correspondence between orthographic and morpho-syntactic word tokens can be considered as a default in English with three main exceptions Multiword units (e.g. so that and in spite of) Mergers (e.g. can’t and gonna) Variably spelt compounds (e.g. noticeboard, notice-board, notice board) CLAWS examples (“ditto tags”) so that: so_CS21 that_CS22 in spite of: in_II31 spite_II32 of_II33 can’t: ca_VM n’t_XX

Explosives found on Hampstead Heath. BNC-style POS tagging <s> <w NN2>Explosives <w VVD>found <w PRP>on <w NP0>Hampstead <w NP0>Heath <PUN> </s> Explosives found on Hampstead Heath. new sentence plural noun past tense verb preposition proper noun proper noun punctuation

Example of semantic tagging See http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf for the tagset.

Types of corpus annotation Syntactic level Parsing / treebanking / bracketing (S (NP Mary) (VP visited (NP a (ADJP very nice) boy))) Stanford Parser http://nlp.stanford.edu:8080/parser/

Types of corpus annotation Discourse level Anaphoric relations (coreference annotation) (6 the married couple 6) said that <REF=6 they were happy with <REF=6 their lot. Speech acts (pragmatic annotation) 3 layers of coding Segmentation (dividing dialogue in textual units, i.e. utterances) Functional annotation (dialogue act annotation) Utterance tags (applying utterance tags that characterize the role of the utterance as a dialogue act) Stylistic features such as speech and thought in presentation (stylistic annotation) The representation of people’s speech and thoughts, known as speech ad thought presentation (S&TP)

Types of corpus annotation Other types Error tagging Applying to learner corpus data The CLEC error tagging scheme consists of 61 error types clustered in 11 categories Problems-specific annotation Not exhaustive – only the phenomenon directly relevant to a particular research question Developed for its relevance to the specific research question, but not for its broad coverage and consensus-based theory-neutrality E.g. Hunston (1993) studies how people talk about sameness and difference (“local grammar”)

Annotation styles Standalone style Embedded style - LOB style <w id=“1”>He</w> <w id=“2”>was</w> <w id=“3”>going</w> <w id=“4”>to</w> <w id=“5”>die</w> <w id= “6”>.</w> </s> <word id=“1”>PPHS1</word> <word id=“2”>VBDZ</word> <word id=“3”>VVGK</word> <word id=“4”>TO</word> <word id=“5”>VVI</word> <word id=“6”>.</word> Embedded style - LOB style going_VVGK TEI entity references going&VVGK; WSJ style going/VVGK SGML <w POS=VVGK>going</w> BNC style (simplified SGML) <w VVGK>going XML <w POS=“VVGK”>going</w>

Introducing CLAWS CLAWS: some basic facts The Constituent Likelihood Automatic Word-tagging System Best known POS tagger for general English Has been used to tag a number of large corpora, including the 100M word BNC Has consistently achieved 96-97% accuracy Free online tagging service allow academic users to tag 100,000 word at a time (from an academic website) http://ucrel.lancs.ac.uk/claws/trial.html

CLAWS tagsets C7 taget C5 tagset A detailed tagset of 146 tags http://ucrel.lancs.ac.uk/claws7tags.html C5 tagset Less refined, 61 tags (BNC tagset) http://ucrel.lancs.ac.uk/claws5tags.html The mapping between C7 and C5 is a many-to-one conversion, and is available in a tab-delimited text file C8 tagset is an extension of C7 tagset that makes further distinctions in the determiner and pronoun categories as well as for auxiliary verbs http://ucrel.lancs.ac.uk/claws8tags.pdf

Free CLAWS trial service

CLAWS output formats Vertical output format Horizontal output format (Use copy & paste and save as a plain text file) Pseudo-XML output format

Windows-based CLAWS D:\ZJU CL\tools\Jclaws\lib\run_jclaws.bat (or antclawsgui) …tagging text in a file

Wmatrix An online corpus analysis and comparison system A web interface that allows you to access to the CLAWS part-of-speech tagger and the USAS semantic tagger CLAWS USAS: UCREL Semantic Analysis System Including standard corpus research tools Frequency, KWIC concordance, wordlist, keyword list, word cluster/n-gram), collocation Built-in statistics model log likelihood for corpus comparison Integrating POS tagging and semantic field annotation into a single profiling tool Introduction to Wmatrix http://ucrel.lancs.ac.uk/wmatrix/

Your Wmarix account You will need a username and password to use Wmatrix Write down your username and password Tag and download your text as soon as possible if you wish to use Wmatrix to tag your data (POS / semantic) on your project …and now login with your account http://ucrel.lancs.ac.uk/wmatrix3.html

Click here to run “tag wizard” Click here to see your work area (for data you have already processed) Click here to find out more about the UCREL Semantic Annotation System

Amongst other things, the link explains the categorisation scheme utilised … McArthur (1981) Longman Lexicon of Contemporary English Hierarchy of 21 major discourse fields (or domains), which expand into 232 semantic field tags (see the web link) semantic field (or domain) = “A named area of meaning in which lexemes interrelate and define each other in specific ways” (Crystal 1995: 157) Note --- the USAS scheme is derived from McArthur (1981)

The USAS system Designed to undertake the automatic semantic analysis of present-day English texts (spoken and written) Involving two stages (i) POS tagging by CLAWS A POS tag is assigned to every lexical item or multi-word expression (MWE), using probabilistic Markov models of likely part-of-speech sequences (accuracy of 97%+) (ii) Output fed into SEMTAG for semantic annotation Semantic tags are assigned automatically on the basis of pattern matching between the target text and two computer dictionaries developed for use with the program (accuracy of 92%+) Present applications: market research, content analysis, information extraction, assistance for translation, linguistic analysis, etc. MWE = phrasal verbs, compound nouns, multi-word proper nouns, pure idioms.

Let’s do some tagging Once you have logged in: From the Wmatrix home page, click on Tag wizard This will bring up the following page …

Let’s do some tagging Tag the following two texts: Tips: It’s a good practice to create one folder for each file Conservative MP Michael Howard’s farewell speech to his party (2005) D:\ZJU CL\texts\Howard_speech.txt New Labour MP Tony Blair’s farewell speech to his party (2006) D:\ZJU CL\texts\texts\Blair_speech.txt

A quick “how to”! Enter new work area name (Blair / Howard) Click the browse button to select the right file Click the “upload now” button … A new screen will provide you with an update report … e.g. part-of-speech tagging semantic tagging frequency lists

You will then be taken to your work area [My folders]

What you’ll see in the Simple “VIEW of folder” Click on Frequency to see the most frequent words You can also do concordance searches of words/phrases

Advanced View of Howard Folder Click on Frequency to see the most frequent words (as before) --- and investigate key parts of speech (POS) and key concepts / domains How might we discover the most ‘frequent’ POS? Jot them down --- and the most ‘frequent’ semantic fields? Make a note of them We can also see all of the keywords using this VIEW

Frequency of words in Howard and Blair (using advanced view) Make a note of the similarities and differences …

Download the tagged text Remember to change filename and file type

Tagging Chinese text ICTCLAS – Institute of Computing Technology, Chinese Lexical Analysis System Best Chinese tagger Fast and reliable (98.45%) Online demonstration Free download of shareware version http://ictclas.org/

Online demo

Standalone ICTCLAS D:\ZJU CL\tools\ICTCLAS\ICTCLAS_Win.exe Tagset - http://www.lancs.ac.uk/fass/projects/corpus/LCMC/lcmc/lcmc_tagset.htm