Topics in Linguistics ENG 331

Topics in Linguistics ENG 331
Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University Week 10

Raw Corpora – Limited Benefits
Raw corpora are of limited use. The best we can do with a raw corpus is to get a wordlist. What do we use wordlists for? Or do some complex searches using RegEx. Give examples! However, to make the most of a corpus, we may need to annotate it; that is to add linguistic information to the texts. Linguistic information can be of any type: phonetic, phonological, morphological, syntactic etc. Week 10

Automatic Corpus Annotation
Corpus annotation is a very tedious process. Imagine how much time you need to manually each word for its grammatical class in a corpus of 1M tokens. It is always good to look for automatic software to do the analysis. This is referred to automatic corpus annotation. Some automatic software used for corpus annotation are: Part-of-Speech (POS) taggers Morphological analyzers Syntactic parsers Semantic role labelers Week 10

Part of Speech Taggers 1 Part-of-Speech (POS) Taggers are computer software that automatically labels each word in context for its grammatical class. In context means that the word must be within a phrase, a clause, or a sentence. There are many POS taggers in the market, especially for English. Two of the widely used ones are: CLAW TreeTagger Week 10

Part of Speech Taggers 2 Input: I have two dogs.
CLAWS Output: I_PNP have_VHB two_CRD dogs_SENT ._PUN TreeTagger Output: I PP I have VHP have two CD two dogs NNS dog . SENT . The differences you see are not only in the formatting (i.e. inline vs. stand-off), but also in the content (i.e. the tagset). Week 10

Part of Speech Taggers 3 If there are many POS taggers out there, how can we select one? Performance against a manually-annotated random sample. The tagset and whether it gives me the information I The tagset is the set of labels used to mark the grammatical class of each word. Some POS taggers have a very concise tagset. For example, the CATiB POS tagger for Arabic has only 6 labels: NOM, VRB, VRB-PASS, PRT, PNX. If you are interested in studying prepositions, this tagset may not be good for you. Week 10

Morphological Analyzers
Morphological analyzers – sometimes referred to as stemmers – split off affixes. One well known example for English is the Porter Stemmer. Morphological analysis is specially important for Morphologically-Rich languages such as Arabic. Can you guess why? A well-known morphological analyzer for Arabic is MADAMIRA. Week 10

Syntactic Parsers Syntactic parsers are computer software that labels phrases for their syntactic types. Examples of well-known syntactic parsers for English are: LINK grammar Stanford parser ARK Week 10

Semantic Role Labeling
Semantic role labeling analyzes the role of each word in the sentence as a subject, an object, an adjunct, etc. One example of semantic role labelers for English is: Week 10

Manual Corpus Annotation
Unfortunately, not all levels of linguistic analysis can be done automatically. For example, to the best of my knowledge there is no software that can extract metaphors, or identify the type of speech act Consequently, we have to go for the manual annotation option; where human beings are hired to read the corpus word for word and do the analysis we want them to do. Although manual annotation is a very expensive and time consuming, it is typically more accurate than automatic annotation. Week 10

Manual Annotation Modes
There are two modes for manual annotation: In-lab: to hire expert human annotators and train them well and have them working under a controlled environment. Crowdsourcing: to have a survey posted online and people from around the world doing the analysis for you. Two of the most famous platforms that host crowdsourcing annotation are Amazon Mechanical Turk and Crowdflower. Do you think crowdsourcing works for all levels of language analysis? What problems can we have with this mode of annotation? Can we avoid them? Week 10

In-Lab vs. Crowdsourcing
The pros and cons of each mode are as follows: Crowdsourcing In-Lab Cheap Expensive Non-expert annotators Expert annotators Typically for simple tasks Works for all tasks Many scammers Controlled environment Week 10

Best Practices For in-lab annotation, the researcher needs to prepare well-written annotation guidelines. All ad hoc decisions need to be written in the guidelines. The annotators should be trained on a sample corpus before working individually. For crowdsourcing, the researcher can have an admission test at the beginning of the online survey, or repeat questions randomly throughout the survey. How to handle disagreement? For in-lab annotation, if there are 3+ annotators you can have a majority vote; otherwise, the researcher can settle disagreement him/herself or exclude the disagreement instances. For crowdsourcing annotation, get as many survey takers as possible to get majority votes. Week 12

Quiz True or False? Stemmers identify affixes.
All POS taggers use the same tagset. Manual annotation uses computer software. Part of speech taggers work at the word level. Automatic annotation relies on human efforts. Syntactic parsers analyze phrases and sentences. Crowdsourcing is cheaper than in-lab annotation. With crowdsourcing, you can select expert annotators. Crowdsourcing might be less accurate than in-lab annotation. Corpus annotation is adding linguistic information to the corpus texts. Week 10

Topics in Linguistics ENG 331

Similar presentations

Presentation on theme: "Topics in Linguistics ENG 331"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topics in Linguistics ENG 331

Similar presentations

Presentation on theme: "Topics in Linguistics ENG 331"— Presentation transcript:

Similar presentations

About project

Feedback