Download presentation
Presentation is loading. Please wait.
Published byAlannah Hardy Modified over 8 years ago
1
Categories and annotation Corpus annotation-the process of adding information to a corpus Annotation-tagging, parsing, annotation of anaphora and semantic annotation The use of annotation is a “category-based” methodology. Words Phonological units clauses categories the basis for corpus searches and statistical manipulation
2
the idea of annotation- (1) adding value to a corpus (2) making it easier to retrieve information (3) increasing the range of investigation
3
Tagging Tagging-allocating a part of speech(POS)label to each word in a corpus The tag can be chosen to give general or specific information light verb nounadj
4
word-form being is being considered Verbpresent participle of a verbpresent participle of the verb be Present participle of the verb be used as an auxiliary
5
The various uses of a tagged corpus easier to look at the concordance line for a word with several senses different parts of speech for a specific word can be compared easier to observe the collocations of a word Total occurrence of word-classes in a particular corpus can be compared The frequency of sequences of tags can be calculated and corpora can be compared
6
easier to look at the concordance line for a word with several senses Ex: {LIGHT}
7
Would be easier if the search can specify that only verb instances of LIGHT are wanted-
8
different parts of speech for a specific word can be compared Ex: Biber et al (1998)-DEAL The form deal is more frequently a noun than a verb. The form deals is more frequently a verb than a noun. all forms of lemma don’t always behave in the same way.
9
easier to observe the collocations of a word Ex: LIGHT LIGHT(v)-cigarette, came, fire, candle, candles LIGHT(n)-red, green, bright, traffic, flashing LIGHT(adj)-dark, brown, blue, touch, very
10
Total occurrence of word-classes in a particular corpus can be compared Biber et al (1999)
11
The frequency of sequences of tags can be calculated and corpora can be compared Aarts and Granger (1998) Dutch, Finnish and French speakers writing in English use fewer of these tag sequences than native- speaker writers do. preposition-article-noun(e.g. in the morning) article-noun-preposition(e.g. a debate on) noun-preposition-noun(e.g. part of speech) noun-preposition-article(e.g. concern for the) non-native writers do not use prepositions in a “native-like” way.
12
Tagging principles rule governing word-classes probability
13
rule governing word-classes Ex: a + light deal noun adj verb fails probability noun? verb? noun deal more frequently occurs as a noun than as a verb
14
Problems with automatic taggers Taggers may be wrong particularly when a word is used in an unusual way. ex: sleaze in the Bank of English taggers didn’t that sleaze can be a verb so sleaze in the the phrase trying to sleaze his way into our hearts is tagged as a noun. The inaccuracies don’t usually spread evenly throughout a corpus.
15
Ex: the-(det),hamster-(noun) a tagger with an accuracy rate of 96%,say, may be 100% accurate for many words, but only 70% accurate for some words. The inaccuracies will affect the reliability of statistics. correct taggs manually once the automatic tagger has finished. tagger can be instructed to suggest more than one tag.
16
Parsing Parsing- analysing the sentences in a corpus into their constituents parts, that is, doing a grammatical analysis. ex: [ The victims’s friends N ] [ told [police N ] [that [Krueger N ] [drove [into [the quarry N ] P ] N ] V ] V ]. the parser identifies boundaries of sentences, clauses and phrases.
17
Parsers difficult to make it all accurately by hand(manual parsing) -a greater degree of accuracy -time-consuming only small corpora can be parsed Ex: the Polytechnic of Wales(PoW)
18
the use of the parsed corpora the basis for the statistical work that has been done on different registers Biber et al (1998) -BEGIN and START in fiction and in academic prose sub-corpora
19
BEGIN/START intransitive+ noun group+ to-clause+ “-ing” clause The intransitive use of START accounts for 64% in academic prose. BEGIN + to-clause accounts for 72% in fiction; BEGIN+ “-ing” clause accounts for only 4% in fiction. Results-
20
explanations- Intransitive START is frequent in academic prose because it indicates the start of a process. BEGIN + to-clause in fiction is to describe the start of an action or a reaction to events- narrative.
21
BEGIN +to-clause > BEGIN +”-ing” clause by collocations- both patterns- verbs which indicates movement(move, walk, fall, run; moving, walking, falling, running) to-clause only- verbs which indicates thought and feeling (feel, think, wonder, realise) both patterns- The to-clause expression is always more frequent than the “-ing” expression.
22
to teach grammatical analysis to students- -McEnery et al (1997) the students who have practised analysis with a parsed corpus do better than equivalent students who have been taught by human being.
23
Other kinds of annotation Annotation of anaphora Semantic annotation How meanings are made
24
Annotation of anaphora anaphora- Halliday and Hasan(1976)- the use of words and phrases in a text to refer to preceding or subsequent words and phrases anaphor- Cohesive item used to summarise or label, thus playing a role in the organisation of text
25
Garside et al (1997)- most anaphora do some or all of the following -identify an anaphor and its antecedent or establish whether an antecedent is identifiable or not -categorise the antecedent -identify the direction or connection -identify the type of anaphor -note the distance between an anaphor and its antecedent
26
The disadvantages of the annotation of anaphora It cannot be done automatically. The amount of text that can be coded this way is limited.
27
Sematic annotation Thomas and Wilson (1996,1997)- Semantic annotation refers to the categorisation of words and phrases in a corpus in terms of a set of semantic fields.
28
Ex: Joanna stubbed out her cigarette with unnecessary fierceness. Joanna Personal Name stubbed out ”Object-Oriented Physical Activity” and “temperature” cigarette ”Luxury item” unnecessay ”Causality/Chance” fierceness ”Anger” her and with ”Low Content Words” and are not assigned to a semantic category
29
Thomas and Wilson (1996)- an analysis of interactions between doctors and patients in two clinics Doctor A vs. Doctor B Doctor A: -more interactive and friendly -more discourse particles, first and second person pronouns and boosters -more categories of “Start”, “Cause”, “Change”, “Power” and “Treatment”
30
-spending more time explaining treatment Doctor B: -using more technical terms -spending time explaining how disease was progressing patients prefer and are more supportive to Doctor A. sematic profile differences
31
“How a meaning is made” Biber and Finegan (1989) and Conrad and Biber (2000) -a variation on semantic annotation -partial annotation because only certain categories are selected Ex: “stance” advs-probably clauses- I think prepositional phrases- on the whole
32
Conrad and Biber (2000) -adverbs are the most frequently used way of expressing stance grammatically in conversation, news reporting and academic prose. -Clauses(eg. I think, I guess) are also frequent in conversation. -Prepositional phrases are extensively used in academic prose and news reporting. it can be linked with the approach to language teaching.
33
Issues in annotation three basic methods of annotating a corpus- manual, computer-assisted and automatic automatic annotation: -the computer works alone, following whatever rules the programmer has determined -are likely to be errors small corpora
34
computer-assisted annotation: -human researcher edit the computer- generated output more accurate -slower less corpus material can be annotated
35
large, tagged corpora is easily available but parsed corpus data is limited heavily annotated, parsed corpus is very valuable The work involved in annotation acts as a constraint against updating or enlarging a corpus Ex: LOB corpus
36
Competing methods ad hoc annotation ex: Sealey (2000)- She annotated to identify all references to children, whatever the lexical item used (child, kid, young person etc). And she gathered all of these instances and observe similarities in patterning.
37
The author’s hope about analyzing a corpus a synergy between word-based and category- based methods much qualitative and quantitative methods of research complement each other
38
Suggestions in Conrad (2000) -Researchers must go “beyond the concordance line” in exploiting corpora -Corpus studies need to progress beyond making ad hoc observations about individual words.
39
Points about corpus annotation Be able to see the plain text Use unconventional, ad hoc annotation as necessary Make the process of annotation as automatic as possible The corpus need to be large enough Be aware of the methods being used when reading about a corpus work
40
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.