Topics in Linguistics ENG 331

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Augmented Transition Networks
Natural Language Understanding Difficulties: Large amount of human knowledge assumed – Context is key. Language is pattern-based. Patterns can restrict.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Stemming, tagging and chunking Text analysis short of parsing.
Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology Stockholm
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.
Corpus Linguistics Anca Dinu February, 2017.
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Natural Language Processing (NLP)
张昊.
Topics in Linguistics ENG 331
Intro to corpus linguistics: Data Driven Grammar
Introduction to Linguistics
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Phil Durrant Debra Myhill Mark Brenchley
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Corpus Linguistics I ENG 617
Machine Learning in Practice Lecture 11
Introduction to Corpus Linguistics ENG 331
Topics in Linguistics ENG 331
The CoNLL-2014 Shared Task on Grammatical Error Correction
Computational Linguistics: New Vistas
Extracting Recipes from Chemical Academic Papers
Introduction to Text Analysis
Natural Language Processing
Applied Linguistics Chapter Four: Corpus Linguistics
Natural Language Processing (NLP)
By Hossein Hematialam and Wlodek Zadrozny Presented by
Analyzing multimedia content
Artificial Intelligence 2004 Speech & Natural Language Processing
Progress report on Semantic Role Labeling
Natural Language Processing (NLP)
Presentation transcript:

Topics in Linguistics ENG 331 Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University rsabbagh@alsun.asu.edu.eg Week 10

Raw Corpora – Limited Benefits Raw corpora are of limited use. The best we can do with a raw corpus is to get a wordlist. What do we use wordlists for? Or do some complex searches using RegEx. Give examples! However, to make the most of a corpus, we may need to annotate it; that is to add linguistic information to the texts. Linguistic information can be of any type: phonetic, phonological, morphological, syntactic etc. Week 10

Automatic Corpus Annotation Corpus annotation is a very tedious process. Imagine how much time you need to manually each word for its grammatical class in a corpus of 1M tokens. It is always good to look for automatic software to do the analysis. This is referred to automatic corpus annotation. Some automatic software used for corpus annotation are: Part-of-Speech (POS) taggers Morphological analyzers Syntactic parsers Semantic role labelers Week 10

Part of Speech Taggers 1 Part-of-Speech (POS) Taggers are computer software that automatically labels each word in context for its grammatical class. In context means that the word must be within a phrase, a clause, or a sentence. There are many POS taggers in the market, especially for English. Two of the widely used ones are: CLAW TreeTagger Week 10

Part of Speech Taggers 2 Input: I have two dogs. CLAWS Output: I_PNP have_VHB two_CRD dogs_SENT ._PUN TreeTagger Output: I PP I have VHP have two CD two dogs NNS dog . SENT . The differences you see are not only in the formatting (i.e. inline vs. stand-off), but also in the content (i.e. the tagset). Week 10

Part of Speech Taggers 3 If there are many POS taggers out there, how can we select one? Performance against a manually-annotated random sample. The tagset and whether it gives me the information I The tagset is the set of labels used to mark the grammatical class of each word. Some POS taggers have a very concise tagset. For example, the CATiB POS tagger for Arabic has only 6 labels: NOM, VRB, VRB-PASS, PRT, PNX. If you are interested in studying prepositions, this tagset may not be good for you. Week 10

Morphological Analyzers Morphological analyzers – sometimes referred to as stemmers – split off affixes. One well known example for English is the Porter Stemmer. Morphological analysis is specially important for Morphologically-Rich languages such as Arabic. Can you guess why? A well-known morphological analyzer for Arabic is MADAMIRA. Week 10

Syntactic Parsers Syntactic parsers are computer software that labels phrases for their syntactic types. Examples of well-known syntactic parsers for English are: LINK grammar Stanford parser ARK Week 10

Semantic Role Labeling Semantic role labeling analyzes the role of each word in the sentence as a subject, an object, an adjunct, etc. One example of semantic role labelers for English is: http://cogcomp.org/page/demo_view/srl Week 10

Manual Corpus Annotation Unfortunately, not all levels of linguistic analysis can be done automatically. For example, to the best of my knowledge there is no software that can extract metaphors, or identify the type of speech act Consequently, we have to go for the manual annotation option; where human beings are hired to read the corpus word for word and do the analysis we want them to do. Although manual annotation is a very expensive and time consuming, it is typically more accurate than automatic annotation. Week 10

Manual Annotation Modes There are two modes for manual annotation: In-lab: to hire expert human annotators and train them well and have them working under a controlled environment. Crowdsourcing: to have a survey posted online and people from around the world doing the analysis for you. Two of the most famous platforms that host crowdsourcing annotation are Amazon Mechanical Turk and Crowdflower. Do you think crowdsourcing works for all levels of language analysis? What problems can we have with this mode of annotation? Can we avoid them? Week 10

In-Lab vs. Crowdsourcing The pros and cons of each mode are as follows: Crowdsourcing In-Lab Cheap Expensive Non-expert annotators Expert annotators Typically for simple tasks Works for all tasks Many scammers Controlled environment Week 10

Best Practices For in-lab annotation, the researcher needs to prepare well-written annotation guidelines. All ad hoc decisions need to be written in the guidelines. The annotators should be trained on a sample corpus before working individually. For crowdsourcing, the researcher can have an admission test at the beginning of the online survey, or repeat questions randomly throughout the survey. How to handle disagreement? For in-lab annotation, if there are 3+ annotators you can have a majority vote; otherwise, the researcher can settle disagreement him/herself or exclude the disagreement instances. For crowdsourcing annotation, get as many survey takers as possible to get majority votes. Week 12

Quiz True or False? Stemmers identify affixes. All POS taggers use the same tagset. Manual annotation uses computer software. Part of speech taggers work at the word level. Automatic annotation relies on human efforts. Syntactic parsers analyze phrases and sentences. Crowdsourcing is cheaper than in-lab annotation. With crowdsourcing, you can select expert annotators. Crowdsourcing might be less accurate than in-lab annotation. Corpus annotation is adding linguistic information to the corpus texts. Week 10