Corpus Linguistics I ENG 617

Slides:



Advertisements
Similar presentations
Academic Quality How do you measure up? Rubrics. Levels Basic Effective Exemplary.
Advertisements

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Android 4: Creating Contents Kirk Scott 1. Outline 4.1 Planning Contents 4.2 GIMP and Free Sound Recorder 4.3 Using FlashCardMaker to Create an XML File.
How to Fill Out the CARD Form (Course Assessment Reporting Data Form)
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Kerry Cook Mathematics Teacher Franklin High School Franklin, NH
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies martinweisser.org.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Creating a Digital Classroom. * Introduction * The Student Experience * Schoology’s Features * Create a Course & Experiment.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Introduction to Blackboard Rabie A. Ramadan Session 3.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Natural Language Processing Vasile Rus
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
A step-by-Step Guide For labels or merges
Development Environment
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Introduction to Corpus Linguistics
Topics Introduction to Repetition Structures
Arab Open University 2nd Semester, M301 Unit 5

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Guide To UNIX Using Linux Third Edition
Natural Language Processing (NLP)
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Adding Assignments and Learning Units to Your TSS Course
How do I utilize EngradePro?
OPERATE A WORD PROCESSING APPLICATION (BASIC)
Computer Programming.
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Part of the Multilingual Web-LT Program
Topics in Linguistics ENG 331
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Module 5: Data Cleaning and Building Reports
Machine Learning in Practice Lecture 11
Topics in Linguistics ENG 331
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
Introduction to TouchDevelop
Chapter 1 Introduction(1.1)
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Blackboard training – green belt
Extracting Recipes from Chemical Academic Papers
Introduction to Text Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
OMGT LECTURE 10: Elements of Hypothesis Testing
Natural Language Processing (NLP)
Web content management
Evaluating Classifiers
Reporting 101 Keenan & Mona.
Natural Language Processing (NLP)
Presentation transcript:

Corpus Linguistics I ENG 617 Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University rsabbagh@alsun.asu.edu.eg Week 12

Raw Corpora – Limited Benefits Raw corpora are of limited use. The best we can do with a raw corpus is to get a wordlist. What do we use wordlists for? Or do some complex searches using RegEx. Give examples! However, to make the most of a corpus, we may need to annotate it; that is to add linguistic information to the texts. Linguistic information can be of any type: phonetic, phonological, morphological, syntactic etc. Week 12

Automatic Corpus Annotation Corpus annotation is a very tedious process. Imagine how much time you need to manually each word for its grammatical class in a corpus of 1M tokens. It is always good to look for automatic software to do the analysis. This is referred to automatic corpus annotation. Some automatic software used for corpus annotation are: Part-of-Speech (POS) taggers Morphological analyzers Syntactic parsers Semantic role labelers Week 12

Part of Speech Taggers 1 Part-of-Speech (POS) Taggers are computer software that automatically labels each word in context for its grammatical class. In context means that the word must be within a phrase, a clause, or a sentence. There are many POS taggers in the market, especially for English. Two of the widely used ones are: CLAW TreeTagger Week 12

Part of Speech Taggers 2 Input: I have two dogs. CLAWS Output: I_PNP have_VHB two_CRD dogs_SENT ._PUN TreeTagger Output: I PP I have VHP have two CD two dogs NNS dog . SENT . The differences you see are not only in the formatting (i.e. inline vs. stand-off), but also in the content (i.e. the tagset). Week 12

Part of Speech Taggers 3 If there are many POS taggers out there, how can we select one? Performance against a manually-annotated random sample. The tagset and whether it gives me the information I The tagset is the set of labels used to mark the grammatical class of each word. Some POS taggers have a very concise tagset. For example, the CATiB POS tagger for Arabic has only 6 labels: NOM, VRB, VRB-PASS, PRT, PNX. If you are interested in studying prepositions, this tagset may not be good for you. Week 12

Morphological Analyzers Morphological analyzers – sometimes referred to as stemmers – split off affixes. One well known example for English is the Porter Stemmer. Morphological analysis is specially important for Morphologically-Rich languages such as Arabic. Can you guess why? A well-known morphological analyzer for Arabic is MADAMIRA. Week 12

Syntactic Parsers Syntactic parsers are computer software that labels phrases for their syntactic types. Examples of well-known syntactic parsers for English are: LINK grammar Stanford parser ARK Week 12

Semantic Role Labeling Semantic role labeling analyzes the role of each word in the sentence as a subject, an object, an adjunct, etc. One example of semantic role labelers for English is: http://cogcomp.org/page/demo_view/srl Week 12

Manual Corpus Annotation Unfortunately, not all levels of linguistic analysis can be done automatically. For example, to the best of my knowledge there is no software that can extract metaphors, or identify the type of speech act Consequently, we have to go for the manual annotation option; where human beings are hired to read the corpus word for word and do the analysis we want them to do. Although manual annotation is a very expensive and time consuming, it is typically more accurate than automatic annotation. Week 12

Manual Annotation Modes There are two modes for manual annotation: In-lab: to hire expert human annotators and train them well and have them working under a controlled environment. Crowdsourcing: to have a survey posted online and people from around the world doing the analysis for you. Two of the most famous platforms that host crowdsourcing annotation are Amazon Mechanical Turk and Crowdflower. Do you think crowdsourcing works for all levels of language analysis? What problems can we have with this mode of annotation? Can we avoid them? Week 12

In-Lab vs. Crowdsourcing The pros and cons of each mode are as follows: Crowdsourcing In-Lab Cheap Expensive Non-expert annotators Expert annotators Typically for simple tasks Works for all tasks Many scammers Controlled environment Week 12

Best Practices For in-lab annotation, the researcher needs to prepare well-written annotation guidelines. All ad hoc decisions need to be written in the guidelines. The annotators should be trained on a sample corpus before working individually. For crowdsourcing, the researcher can have an admission test at the beginning of the online survey, or repeat questions randomly throughout the survey. How to handle disagreement? For in-lab annotation, if there are 3+ annotators you can have a majority vote; otherwise, the researcher can settle disagreement him/herself or exclude the disagreement instances. For crowdsourcing annotation, get as many survey takers as possible to get majority votes. Week 12

Annotation Software 1 For in-lab annotation, it is sometimes useful to have an annotation software. It helps annotators maintain consistency and some tools gives you descriptive and contrastive statistics about the corpus. One easy-to-use software is UAM. Suppose that we want to annotate a corpus of poems for figurative language. How can we use UAM for that? For illustration purposes, we will use the following poem: Your feet smell so bad Just like limburger cheese That I'm holding my nose tight Between my two knees. Week 12

UAM Corpus Tool 1 After installing UAM on your local machine, you will need to run it and click “start new project” Week 12

UAM Corpus Tool 2 After naming your project and selecting a folder to save all the annotation files in, you will need to upload your corpus files. You click “extend corpus” first, and as the files appear in the lower panel, you click “incorporate all”. Week 12

UAM Corpus Tool 3 After uploading your corpus files, you can start building your annotation scheme from Layers > Add Layer > Start After naming your scheme, there will be a list of choices that you need to make: Automatic vs. manual annotation: UAM has a number of built-in POS taggers and parsers but they only work of the English language. So in our case, we will choose manual annotation. Since we don’t have a predefined scheme, we will design our own. Figurative language is realized at the word, phrase, clause, or sentence level, then we will choose “segments within a document”. Week 12

UAM Corpus Tool 4 There is no one special document layer that we are working on, so we will choose “no” for the special layer question and “no” again for the automatic segmentation. Now, our scheme is being created but we still need to edit it a little. Week 12

UAM Corpus Tool 5 To edit the scheme, click “edit scheme” Right click the features (in the red circle) to rename them. Right click the name of the scheme (in the blue circle) to add features Week 12

UAM Corpus Tool 6 Now your corpus files are uploaded and your annotation scheme is ready, click figurative to start annotating. From files, click the name of your annotation scheme to start. Week 12

UAM Corpus Tool 7 The annotation screen is very handy. You need to mark the segment you want and click on the tag that best describes it. You proceed until the end of the text and, of course, do not forget to save your work. You can also leave comments next to each segment that you annotate. Week 12

UAM Corpus Tool 8 As the annotation is done. It is time to get statistics. So click on ‘statistics’. To store your annotation as an XML file, from the annotation screen click Coding > Export Annotation Week 12

Inter-Annotator Agreement 1 How to measure inter-annotator agreement? Percentage agreement is the easiest and most straightforward measure. However, it does not take into consideration chance agreement. A more sophisticated way is Cohen’s Kappa Coefficient that takes into consideration chance agreement. How does it work? It works when you have two annotators. It works when you have mutually-exclusive categorical annotation. Week 12

Inter-Annotator Agreement 2 Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. First, we calculate observed probability Po 𝑃 𝑜 = 20+15 50 = 0.7 Second, we calculate expected probability that both annotators would say positive at random: 𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 20+5 50 . 20+10 50 = 0.5 * 0.6 = 0.3 Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

Inter-Annotator Agreement 3 Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. Third, we calculate expected probability that both annotators would say negative at random: 𝑃 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 10+15 50 . 5+15 50 = 0.5 * 0.4 = 0.2 Fourth, we calculate the expected probability for both categories – positive and negative: Pe = Ppositive + Pnegative = 0.3 + 0.2 = 0.5 Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

Inter-Annotator Agreement 4 Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. Third, we calculate expected probability that both annotators would say negative at random: 𝑃 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 10+15 50 . 5+15 50 = 0.5 * 0.4 = 0.2 Fourth, we calculate the expected probability for both categories – positive and negative: Pe = Ppositive + Pnegative = 0.3 + 0.2 = 0.5 Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

Inter-Annotator Agreement 5 Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. Finally, we calculate Kappa Coefficient: 𝐾= 𝑃 𝑜 − 𝑃 𝑒 1− 𝑃 𝑒 = 0.7 −0.5 1 −0.5 = 0.4 (moderate) We can see the big difference between percent agreement (0.7) and Kappa coefficient (0.4). Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

Inter-Annotator Agreement 6 How to interpret Kappa Coefficient? Cohen’s Kappa Coefficient is one of many other inter-annotator agreement measures, each of which is suitable for a specific situation. K Interpretation < 0 poor agreement 0.01 – 0.20 Slight agreement 0.21 – 0.40 Fair agreement 0.41 – 0.60 Moderate agreement 0.61 – 0.80 Substantial agreement 0.81 – 1.00 Almost perfect agreement Week 12

Quiz True or False? Stemmers identify affixes. All POS taggers use the same tagset. Manual annotation uses computer software. Part of speech taggers work at the word level. Automatic annotation relies on human efforts. Syntactic parsers analyze phrases and sentences. Crowdsourcing is cheaper than in-lab annotation. With crowdsourcing, you can select expert annotators. Scammers are one main disadvantage of in-lab annotation. Crowdsourcing might be less accurate than in-lab annotation. Percent agreement takes into consideration chance agreement. Corpus annotation is adding linguistic information to the corpus texts. Week 12

Quiz True or False? It is a must to use a corpus annotation tool. Majority vote is one way to settle annotation disagreement. A high inter-annotator agreement rate indicates reliable annotation. Annotation guidelines are designed by the researcher not the annotators. Week 12