Automatically Enhancing Tagging Accuracy and Readability for Common Freeware Taggers Martin Weisser Center for Linguistics & Applied Linguistics Guangdong.

Slides:



Advertisements
Similar presentations
A Probabilistic Representation of Systemic Functional Grammar Robert Munro Department of Linguistics, SOAS, University of London.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Sentence Analysis Week 3 – DGP for Pre-AP.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Stemming, tagging and chunking Text analysis short of parsing.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Dictionary.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Some Advances in Transformation-Based Part of Speech Tagging
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
_____________________ Definition Part of Speech (circle one) Picture Antonym (Opposite) Vocab Word Noun Pronoun Adjective Adverb Conjunction Verb Interjection.
Word classes and part of speech tagging Chapter 5.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Sentence Analysis Week 2 – DGP for Pre-AP.
Artificial Intelligence: Natural Language
Parts of Speech Major source: Wikipedia. Adjectives An adjective is a word that modifies a noun or a pronoun, usually by describing it or making its meaning.
Parts of Speech Review. A Noun is a person, place, thing, or idea.
GoBack definitions Level 1 Parts of Speech GoBack is a memorization game; the teacher asks students definitions, and when someone misses one, you go back.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Word classes and part of speech tagging Chapter 5.
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
DGP Week Twelve. Monday DGP Directions: Identify each word as a noun, pronoun, verb, adverb, adjective, preposition, conjunction, interjection, article.
Differences between Spoken and Written Discourse Source: Paltridge, p.p
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Language Identification and Part-of-Speech Tagging
Introduction to Linguistics
Automatic Writing Evaluation
Lecture 9: Part of Speech
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.
CSC 594 Topics in AI – Natural Language Processing
Words, Phrases, Clauses, & Sentences
Tools for Natural Language Processing Applications
The Parts of Speech.
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Grammar Review.
Probabilistic and Lexicalized Parsing
LING/C SC 581: Advanced Computational Linguistics
Phil Durrant Debra Myhill Mark Brenchley
Topics in Linguistics ENG 331
Parts of the speech and abbreviations
FIRST SEMESTER GRAMMAR
PART OF SPEECH TAGGING (POS)
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
The CoNLL-2014 Shared Task on Grammatical Error Correction
English parts of speech
Writing Conventions Grammar and Composition
Grammar Time! STEP 1: Pick up 5 pieces of paper from the student center – whatever color and color combination you desire.
GEE’S Writing RULES.
Natural Language Processing
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Automatically Enhancing Tagging Accuracy and Readability for Common Freeware Taggers Martin Weisser Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies weissermar@hotmail.com http://martinweisser.org

Outline Motivation Typical freeware tagger output Enhancing the readability of the tagset Diversifying tag categories Correcting probabilistic errors The Tagging Optimiser Conclusion

Motivation PoS tagging an essential part of many linguistics projects commercial taggers too costly for smaller-scale, non-funded, projects freeware taggers useful, but with certain drawbacks usu. command-line based possibly difficult to set up/configure (e.g. setting classpath for Stanford Tagger) little control over existing GUIs tagset too limited for fine-grained grammatical analysis tagsets (somewhat) difficult to read/remember errors based on probabilistic approaches

Typical freeware tagger output differences in tag assignment Tagger Stanford Simple PoS Tagger TagAnt (TreeTagger vertical format) Word 1 THE_DT THE_DET THE → DT → the Word 2 CLAIMS_VBZ CLAIMS → NN → CLAIMS Word 3 OF_IN OF → IN → of Word 4 A_DT A_DET A → DT → a Word 5 DOG_NN DOG_NNP DOG → NP → Dog underscore format tab format + lemma slight differences in tag labels

Enhancing the readability of the tagset traditional tags generally consist of 3-4 characters, usu. comprising first character for base category (e.g. N for nouns, V for verbs) additional characters for sub-specification (as applicable) problems need to remember & expand sometimes complex abbreviations using initial as mnemonic for base category not always possible, due to overlap in initials (e.g A for article, adjective & adverb) solutions expand base category names suitably → Adj, Adv, Noun, Verb mark sub-specification via ~, e.g. Adj~comp, Verb~modal generally facilitate expansion through longer abbreviations

Diversifying tag categories (1) – Issues original Penn Treebank categories used by freeware taggers geared towards parsing hence minimalistic (48 tags only) few sub-distinctions conflation of categories, e.g. IN for prepositions & subordinating conjunctions 13 ‘reserved’ for punctuation or special characters less suitable for more fine-grained grammatical analyses many useful distinctions, such as between coordinating and subordinating conjunctions, or subject and object pronouns, not present suitability for tagging spoken language limited, e.g. interjections, discourse markers & response signals conflated under UH tag

Diversifying tag categories (2) – Solutions distinguish between main categories conjunctions and prepositions to as Inf marker and preposition adjectives and ordinal numbers distinguish between different sub-categories determiners: definite, indefinite, demonstrative verbs: BE-, DO- & HAVE-forms treated/marked differently from general verbs pronouns: include case roles as much as possible mark discourse-relevant features more clearly, identifying discourse markers (DM) minimal yes-/no-responses (Resp~yes & Resp~no) interjections (Interj)

Correcting probabilistic errors (1) – reasons according to Manning (2011) 97% top accuracy generally reported = “token accuracy” in contrast, “sentence accuracy […] around 55–57%” reasons to be found at different levels in tagger design/annotated training materials gaps in the lexicon unknown words errors where the tagger ought to plausibly be able to identify the right tag errors that may require more linguistic context underspecified or unclear tag examples incon­sistent or no standard wrong gold standard

Correcting probabilistic errors (2) – examples of corrections our_PRPS hero_NN committed_VBN a_DET cardinal_NN sin_NN committed_Verb~past They_PRP disgust_NN and_CC please_VB us_PRP disgust_Verb~not3s the_DT pleasant_JJ trickle_VB of_IN Glenlivet_NNP Double_JJ Malt_NN trickle_Noun~Sing manhood_NN blighted_JJ with_IN unripe_NN di­sease_NN blighted_Verb~ppart and_CC that_IN glorious_JJ sound_NN which_WDT only_RB our_PRP$ British_JJ amateur_JJ choruses_NNS produce_VBP that_Det~demSing

The Tagging Optimiser extension selector editor pane I/O workspaces

Conclusion first attempt at improving output, but room for improvement  by making use of the best probabilistic freeware taggers have to offer diversifying tag categories further by establishing more contextual rules increasing tag precision by taking into account more contextual clues, e.g. for underspecified forms need for community feedback on performance and suitability of tagset, perhaps working towards a new community standard version 1.0 will be downloadable from http:/martinweisser.org/ling_soft.html#tagOpt by early October 2018

References Anthony, L. (2015). TagAnt (Version 1.2.0) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/. Fligelstone, S., Rayson, P., and Smith, N. (1996). Template analysis: bridging the gap between grammar and the lexicon. In J. Thomas & M. Short (Eds.). Using corpora for language research. London: Longman. 181–207. Garside, R. and Smith, N. (1997). A hybrid grammatical tagger: CLAWS4. In R. Garside, G. Leech, & T. McEnery (Eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. 102–121. Manning, C. (2011). Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In A. Gelbukh (Ed.). Proceedings of the 12th international conference on Computational linguistics and intelligent text processing (CICLing’11). 171–189. Ó Duibhín, C. (2017). Windows Interface for Tree Tagger. [Computer Software]. Available from http://www.smo.uhi.ac.uk/~oduibhin/oideasra/interfaces/winttinterface.htm. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.