Corpus Linguistics I ENG 617

Corpus Linguistics I ENG 617
Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University Week 12

Raw Corpora – Limited Benefits
Raw corpora are of limited use. The best we can do with a raw corpus is to get a wordlist. What do we use wordlists for? Or do some complex searches using RegEx. Give examples! However, to make the most of a corpus, we may need to annotate it; that is to add linguistic information to the texts. Linguistic information can be of any type: phonetic, phonological, morphological, syntactic etc. Week 12

Automatic Corpus Annotation
Corpus annotation is a very tedious process. Imagine how much time you need to manually each word for its grammatical class in a corpus of 1M tokens. It is always good to look for automatic software to do the analysis. This is referred to automatic corpus annotation. Some automatic software used for corpus annotation are: Part-of-Speech (POS) taggers Morphological analyzers Syntactic parsers Semantic role labelers Week 12

Part of Speech Taggers 1 Part-of-Speech (POS) Taggers are computer software that automatically labels each word in context for its grammatical class. In context means that the word must be within a phrase, a clause, or a sentence. There are many POS taggers in the market, especially for English. Two of the widely used ones are: CLAW TreeTagger Week 12

Part of Speech Taggers 2 Input: I have two dogs.
CLAWS Output: I_PNP have_VHB two_CRD dogs_SENT ._PUN TreeTagger Output: I PP I have VHP have two CD two dogs NNS dog . SENT . The differences you see are not only in the formatting (i.e. inline vs. stand-off), but also in the content (i.e. the tagset). Week 12

Part of Speech Taggers 3 If there are many POS taggers out there, how can we select one? Performance against a manually-annotated random sample. The tagset and whether it gives me the information I The tagset is the set of labels used to mark the grammatical class of each word. Some POS taggers have a very concise tagset. For example, the CATiB POS tagger for Arabic has only 6 labels: NOM, VRB, VRB-PASS, PRT, PNX. If you are interested in studying prepositions, this tagset may not be good for you. Week 12

Morphological Analyzers
Morphological analyzers – sometimes referred to as stemmers – split off affixes. One well known example for English is the Porter Stemmer. Morphological analysis is specially important for Morphologically-Rich languages such as Arabic. Can you guess why? A well-known morphological analyzer for Arabic is MADAMIRA. Week 12

Syntactic Parsers Syntactic parsers are computer software that labels phrases for their syntactic types. Examples of well-known syntactic parsers for English are: LINK grammar Stanford parser ARK Week 12

Semantic Role Labeling
Semantic role labeling analyzes the role of each word in the sentence as a subject, an object, an adjunct, etc. One example of semantic role labelers for English is: Week 12

Manual Corpus Annotation
Unfortunately, not all levels of linguistic analysis can be done automatically. For example, to the best of my knowledge there is no software that can extract metaphors, or identify the type of speech act Consequently, we have to go for the manual annotation option; where human beings are hired to read the corpus word for word and do the analysis we want them to do. Although manual annotation is a very expensive and time consuming, it is typically more accurate than automatic annotation. Week 12

Manual Annotation Modes
There are two modes for manual annotation: In-lab: to hire expert human annotators and train them well and have them working under a controlled environment. Crowdsourcing: to have a survey posted online and people from around the world doing the analysis for you. Two of the most famous platforms that host crowdsourcing annotation are Amazon Mechanical Turk and Crowdflower. Do you think crowdsourcing works for all levels of language analysis? What problems can we have with this mode of annotation? Can we avoid them? Week 12

In-Lab vs. Crowdsourcing
The pros and cons of each mode are as follows: Crowdsourcing In-Lab Cheap Expensive Non-expert annotators Expert annotators Typically for simple tasks Works for all tasks Many scammers Controlled environment Week 12

Best Practices For in-lab annotation, the researcher needs to prepare well-written annotation guidelines. All ad hoc decisions need to be written in the guidelines. The annotators should be trained on a sample corpus before working individually. For crowdsourcing, the researcher can have an admission test at the beginning of the online survey, or repeat questions randomly throughout the survey. How to handle disagreement? For in-lab annotation, if there are 3+ annotators you can have a majority vote; otherwise, the researcher can settle disagreement him/herself or exclude the disagreement instances. For crowdsourcing annotation, get as many survey takers as possible to get majority votes. Week 12

Annotation Software 1 For in-lab annotation, it is sometimes useful to have an annotation software. It helps annotators maintain consistency and some tools gives you descriptive and contrastive statistics about the corpus. One easy-to-use software is UAM. Suppose that we want to annotate a corpus of poems for figurative language. How can we use UAM for that? For illustration purposes, we will use the following poem: Your feet smell so bad Just like limburger cheese That I'm holding my nose tight Between my two knees. Week 12

UAM Corpus Tool 1 After installing UAM on your local machine, you will need to run it and click “start new project” Week 12

UAM Corpus Tool 2 After naming your project and selecting a folder to save all the annotation files in, you will need to upload your corpus files. You click “extend corpus” first, and as the files appear in the lower panel, you click “incorporate all”. Week 12

UAM Corpus Tool 3 After uploading your corpus files, you can start building your annotation scheme from Layers > Add Layer > Start After naming your scheme, there will be a list of choices that you need to make: Automatic vs. manual annotation: UAM has a number of built-in POS taggers and parsers but they only work of the English language. So in our case, we will choose manual annotation. Since we don’t have a predefined scheme, we will design our own. Figurative language is realized at the word, phrase, clause, or sentence level, then we will choose “segments within a document”. Week 12

UAM Corpus Tool 4 There is no one special document layer that we are working on, so we will choose “no” for the special layer question and “no” again for the automatic segmentation. Now, our scheme is being created but we still need to edit it a little. Week 12

UAM Corpus Tool 5 To edit the scheme, click “edit scheme”
Right click the features (in the red circle) to rename them. Right click the name of the scheme (in the blue circle) to add features Week 12

UAM Corpus Tool 6 Now your corpus files are uploaded and your annotation scheme is ready, click figurative to start annotating. From files, click the name of your annotation scheme to start. Week 12

UAM Corpus Tool 7 The annotation screen is very handy. You need to mark the segment you want and click on the tag that best describes it. You proceed until the end of the text and, of course, do not forget to save your work. You can also leave comments next to each segment that you annotate. Week 12

UAM Corpus Tool 8 As the annotation is done. It is time to get statistics. So click on ‘statistics’. To store your annotation as an XML file, from the annotation screen click Coding > Export Annotation Week 12

Inter-Annotator Agreement 1
How to measure inter-annotator agreement? Percentage agreement is the easiest and most straightforward measure. However, it does not take into consideration chance agreement. A more sophisticated way is Cohen’s Kappa Coefficient that takes into consideration chance agreement. How does it work? It works when you have two annotators. It works when you have mutually-exclusive categorical annotation. Week 12

Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. First, we calculate observed probability Po 𝑃 𝑜 = = 0.7 Second, we calculate expected probability that both annotators would say positive at random: 𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = = 0.5 * 0.6 = 0.3 Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. Third, we calculate expected probability that both annotators would say negative at random: 𝑃 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = = 0.5 * 0.4 = 0.2 Fourth, we calculate the expected probability for both categories – positive and negative: Pe = Ppositive + Pnegative = = 0.5 Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

Suppose we have a corpus of 50 tweets and two annotators. Each of the annotators is to label each tweet as either positive or negative. Finally, we calculate Kappa Coefficient: 𝐾= 𝑃 𝑜 − 𝑃 𝑒 1− 𝑃 𝑒 = 0.7 −0.5 1 −0.5 = 0.4 (moderate) We can see the big difference between percent agreement (0.7) and Kappa coefficient (0.4). Ann. 1 Positive Negative Ann. 2 20 5 10 15 Week 12

How to interpret Kappa Coefficient? Cohen’s Kappa Coefficient is one of many other inter-annotator agreement measures, each of which is suitable for a specific situation. K Interpretation < 0 poor agreement 0.01 – 0.20 Slight agreement 0.21 – 0.40 Fair agreement 0.41 – 0.60 Moderate agreement 0.61 – 0.80 Substantial agreement 0.81 – 1.00 Almost perfect agreement Week 12

Quiz True or False? Stemmers identify affixes.
All POS taggers use the same tagset. Manual annotation uses computer software. Part of speech taggers work at the word level. Automatic annotation relies on human efforts. Syntactic parsers analyze phrases and sentences. Crowdsourcing is cheaper than in-lab annotation. With crowdsourcing, you can select expert annotators. Scammers are one main disadvantage of in-lab annotation. Crowdsourcing might be less accurate than in-lab annotation. Percent agreement takes into consideration chance agreement. Corpus annotation is adding linguistic information to the corpus texts. Week 12

Quiz True or False? It is a must to use a corpus annotation tool.
Majority vote is one way to settle annotation disagreement. A high inter-annotator agreement rate indicates reliable annotation. Annotation guidelines are designed by the researcher not the annotators. Week 12

Corpus Linguistics I ENG 617

Similar presentations

Presentation on theme: "Corpus Linguistics I ENG 617"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus Linguistics I ENG 617

Similar presentations

Presentation on theme: "Corpus Linguistics I ENG 617"— Presentation transcript:

Similar presentations

About project

Feedback